Datacoves blog

Learn more about dbt Core, ELT processes, DataOps,
modern data stacks, and team alignment by exploring our blog.
Build vs buy Data Platform
dbt alternatives
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Getting started with dbt what to learn first
5 mins read

Getting Started with dbt: What to Learn First 

To get started with dbt, learn the command line and Git, how dbt connects to your warehouse, how it organizes data into layers, and the modeling concepts behind your tables. 

You don't have to master any of it first. You just need to know it exists and why it matters. 

Why dbt instead of stored procedures or a custom framework 

Plenty of teams still run their transformations as stored procedures or a homegrown framework someone wrote years ago. The logic grows into thousands of lines spread across procedures and scripts, nobody can see how one piece connects to the next, and a change in one place quietly breaks something three steps downstream. You find out when a stakeholder asks why their numbers look wrong. When the person who built it leaves, the knowledge walks out with them. 

dbt brings software engineering discipline to your transformations. Every model is version-controlled SQL the whole team can read. Dependencies are explicit through ref() and source(), so you get lineage for free instead of reverse-engineering it. Tests run automatically, and changes ship through pull requests like any other code. That's the shift analysts make when they move into analytics engineering, and it's why dbt is worth learning. 

dbt replaces stored procedures and custom frameworks with version-controlled SQL, automatic lineage, and tests that run in CI. 

Stored Procedures / Custom Framework dbt
Version control Rare; logic lives inside the database Every model is versioned in Git
Lineage Manual to trace, often undocumented Built automatically from ref() and source()
Testing Ad hoc, if it happens at all Generic and custom tests run in CI
Code review Hard; changes often go straight to the database Pull requests, like any other code
Onboarding Tribal knowledge, slow to transfer A readable project a new hire can follow
Ownership Unclear; breaks surface downstream Explicit dependencies and clear owners
Lock-in High; code is tied to one warehouse Low; portable SQL across most warehouses

Stored procedures and custom frameworks compared with dbt.

Here's the short list, roughly in the order it tends to come up.

Get your environment set up

Command line basics. dbt runs from the terminal, so get comfortable there before anything else. You'll spend your day with cd, ls, mkdir, and the dbt commands themselves. If the terminal is new to you, Terminal Tutor builds the muscle memory quickly, and the dbt cheat sheet covers the dbt commands and selectors you'll reach for most. 

Git fundamentals. Git is its own skill, separate from the command line, and it's how real dbt work ships. Learn clone, add, commit, push, and pull first, then move on to branching and pull requests, since that's how every team reviews and merges changes. The official Git tutorials and this walkthrough will get you there. 

Profiles. Your profile tells dbt how to connect to your warehouse. If it's misconfigured, nothing runs, so it's worth getting right early and then you can mostly forget about it.  

Keep profiles.yml in ~/.dbt, not in your project folder, and never commit it, since it holds your credentials (dbt docs on connection profiles). Datacoves and dbt Cloud both handle this part for you. The dbt terminology post breaks down how targets and outputs work inside the file.

Learn how dbt organizes and references data 

Data layers. Most dbt projects organize models into layers: raw, staging, core, and marts. You'll also hear bronze, silver, and gold. At Datacoves we use inlets, bays, and coves. They all map to the same idea. Raw lands untouched so you can trace origin and keep history. Staging does light cleanup like casting, aliasing, and flattening. Core is where reusable facts and dimensions get built. Marts are the business-facing models people actually query. Treat the names as an interchangeable convention. The part that affects your project is consistency: agree as a team on where facts and dims live, what each layer is responsible for, and which tests belong at each level. Our inlets, bays, and coves guide shows one way to define that. 

Data layer names like staging, silver, and inlets are interchangeable conventions; the thing that affects your project is whether your team uses them consistently.
Data organization layers

Ref vs source. This distinction tells you whether you actually understand lineage.  

Use source() to reference raw tables loaded outside dbt and ref() to reference other dbt models. 

Get them right and dbt builds your dependency graph for you. The dbt Jinja functions cheat sheet has the syntax. 

Understand the modeling underneath your tables 

Data modeling concepts. Before you write a model, learn the basics of dimensional modeling. Start with Kimball and the star schema, since it's still the most widely used approach (Kimball Group's dimensional modeling techniques). Primary keys, foreign keys, composite keys, and surrogate keys all live here, and they're what keep your data trustworthy once it's joined. This video is a quick tour of the main modeling approaches, and dbt-utils gives you generate_surrogate_key so you don't hand-roll them (see the dbt-utils cheat sheet). 

The building blocks you'll use every day 

Macros. Macros are reusable pieces of SQL logic, written with Jinja. Once they click, you'll wonder how you wrote SQL without them. Keep them in folders organized by purpose instead of dumping everything into one file, or your macros directory turns into a junk drawer fast. The dbt Jinja cheat sheet covers the syntax. 

Seeds. Seeds are small static CSVs that dbt loads into your warehouse. Use them for cross-reference and lookup files, things like country codes or account mappings, not for loading real data. Keep them stable, since changing a seed means a code release. When the file's columns change, reload it with dbt seed --full-refresh. There's more in the dbt terminology post

YML files. YML files hold your documentation, tests, and config. Don't pile everything into one giant sources.yml or models.yml. Use one file per source or model folder, prefixed with an underscore like _google_analytics.yml, so it sorts to the top of the folder and lives right next to what it describes. Small, focused files are far easier to find and review. 

dbt tests. Tests catch problems before they reach a dashboard. dbt ships with generic tests like unique and not_null, you can write your own singular tests as plain SQL, and packages like dbt-expectations and dbt-utils add more. Start simple and add coverage where it hurts. Our overview of dbt testing options lays out when to use each. 

Snapshots. Snapshots capture how a record changed over time, which is dbt's answer to slowly changing dimensions. You'll want this the first time someone asks what a customer's status was three months ago. The dbt docs on snapshots are the place to start. 

Incremental models. When full refreshes start hurting, incremental models are what save you. Instead of rebuilding a table from scratch on every run, dbt only processes new or changed rows. You won't need this on day one, but you'll be glad you know it exists when a table gets big. The analytics glossary has a quick definition. 

Where to learn more

A few resources worth bookmarking as you go: 

  • DataGym.io, hands-on dbt practice built by Bruno Lima. Follow Bruno on LinkedIn too, he shares some of the best practical dbt tips around. 
  • dbt Libraries, a curated set of packages and adapters worth knowing about. 
Getting started with dbt

Going from zero to a mature data stack 

You'll pick up most of this on the job. The point of the list is that when these concepts show up, and they will, you'll recognize them instead of getting stuck. That's the individual side of dbt. 

Running dbt across a team is a bigger job. Someone has to settle your Git repository structure and branching strategy, decide how many environments and schemas you need, set dbt development standards and approve the packages and macros people can use, stand up CI/CD and Slim CI, write the documentation and SQL linting rules, build the Airflow orchestration, and handle PII masking and warehouse roles. Those choices are the foundation a mature stack sits on, and reversing them later is painful. 

Datacoves manages the platform and sets up the foundational DataOps and security decisions, so teams go from zero to a mature data stack in weeks. 

That's where Datacoves comes in, on both fronts. We manage the platform, so your team isn't maintaining dbt, Airflow, and CI/CD on Kubernetes. And our Data Architecture Foundation engagement sets those decisions up with you, from Git and environment design to CI/CD, documentation and linting standards, orchestration, and PII handling, so a team can go from zero to a mature stack in weeks. If that's where you're headed, a free architecture review is a good place to start. 

Getting starting with dbt
5 mins read

When you are learning to use a new tool or technology, one of the hardest things is learning all the new terminology. As we pick up language throughout our lives, we develop an association between words and our mental model of what they represent. The next time we see the word again that picture pops up in our head and if the word is now being used to mean something new, we must create a new mental model. . In this post, we introduce some core dbt (data building tool)terminology and how it all connects.

Language understanding is interesting in that once we have a mental model of a term, we have a hard time grasping the new association. I still remember the first time I spoke to someone about the Snowflake Data Warehouse, and they used the term warehouse. To me, the term had two mental models. One was a place where we store a lot of physical goods, type Costco Warehouse into Google and the first result is Costco Wholesale, a large retailer in the US that is so big it is literally a warehouse full of goods.

I have also worked in manufacturing, so I also associated a warehouse as the place where raw materials and finished goods are stored.

a warehouse as the place where raw materials and finished goods are stored

In programming, we would say we are overloading the term warehouse to mean different things.

In some programming languages, function overloading or method overloading is the ability to create multiple functions of the same name with different implementations – Wikipedia

 We do this type of thing all the time and don’t think twice about it. However, if I say “I need a bass” do you know what I am talking about?

Bass Guitar

In my Snowflake example, I knew the context was technology and more specifically something to do with databases, so I already had a mental model for a warehouse. It’s even in Wikipedia's description of the company.

Snowflake Inc. is a cloud computing-based data warehousing company - Wikipedia

I knew of data warehouses from Teradata and Amazon (Redshift), so it was natural for me to think of a warehouse as a technology and a place where lots of data is stored. In my mind, I quickly thought of

  • The Redshift Warehouse
  • The Teradata Warehouse
  • The Snowflake Warehouse

For those new to the term warehouse, I may have lost you already.  Maybe you are new to dbt and you come from the world of tools like Microsoft Excel, Alteryx, Tableau, and PowerBI. If you know all this, grant me a few minutes to bring everyone up to speed.

Let’s step back and first define a database.

A database is an organized collection of structured information, or data, typically stored electronically in a computer system - Oracle
Nick Carter on Twitter: "@SpeedwaySam the first? Wha wha whaaat?? Let's not  make it the last. Thank you for being a loyal fan. http://t.co/lSfJG7MB7P"  / Twitter

Ok, you probably know Excel. You have probably also seen an Excel Workbook with many sheets. If you organize your data neatly in Excel like the image below, we could consider that workbook a database.

Excel Sheet, a type of database

Going back to the definition above “organized collection of structured information” you can see that we have structured information, a list of orders with a Date, Order Quantity, and Order Amount. We also have a collection of these, namely Orders and Invoices.

In database terms, we call each Excel sheet a table and each of the columns an attribute.

Now back to a warehouse. This was my mental model of a warehouse.

A data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise - Wikipedia

Again, if you are new to all this jargon, the above definition might not make much sense to you. Going back to our Excel Example. In an organization, you have many people with their own “databases” like the example above. Jane has one, Mario has another, Elena has a third. All have some valuable information we want to combine in order to make better decisions. So instead of keeping these Excel workbooks separately, we put them all together into a database and now we call that a warehouse. We use this central repository for our “business intelligence”

So, knowing all of this, when I heard of a Snowflake warehouse the above is what I thought. It is the place where we have all the data, duh. Just like Redshift and Teradata.  But look at what the people at Snowflake did, they changed the meaning on me.

A virtual warehouse, often referred to simply as a “warehouse”, is a cluster of compute resources in Snowflake. - Snowflake

The term warehouse here is no longer about the storage of things it now means “cluster of compute” A what of what?

Ok, let’s break this down. You are probably reading this on a laptop or some other mobile device. That device stores all your documents and when you perform some actions it “computes” certain things.  Well, in Snowflake the storage of the information is separate and independent of the computation on the things that are stored.  So, you can store things once and connect different “computers” to it. Imagine you were performing a task on your laptop, and it was slow. What if you could reach in your desk drawer, pull out a faster computer, and speed up the task that was slow, well, in Snowflake you can. Also, instead of just having one computer doing the work, they have a cluster of computers working together to get the job done even faster.

As you can see, language is tricky, and creating a shared understanding of it is crucial to advancing your understanding and mastery of the technology. Every Snowflake user develops the new mental model for a warehouse and using it is second nature, but we forget that these terms that are now natural to us may still be confusing to newcomers.

Understanding dbt (data build tool) terminology

Let’s start with dbt. When you join the dbt Slack community you will inevitably learn that the preferred way to write dbt is all lower case. Not DBT, not Dbt, just dbt.  I still don’t know why exactly, but you may have noticed that everyone in this space always puts “dbt (Data Build Tool)”

If you have some knowledge of Behavioral Therapy you may already know that DBT has a different meaning. Dialectical behavior therapy (DBT)

Dialectical behavioral therapy (DBT) is a type of cognitive-behavioral therapy. Cognitive-behavioral therapy tries to identify and change negative thinking patterns and pushes for positive behavioral changes.

Did you notice how they do the inverse? They spell out Dialectical behavior therapy and put DBT in parenthesis. So, maybe the folks at Fishtown Analytics, now dbt Labs came across this other meaning for DBT and chose to differentiate by using lowercase, or maybe it was to mess with all of the newbies lol. 

So update your auto-correct and don’t let dbt become DBT or Dbt or you will hear from someone in the community, haha.

Now let’s do a quick rundown of terms you will hear in dbt land which may confuse you as you start your dbt journey. I will link to the documentation with more information. My job here is to hopefully create a good mental model for you, not to teach you all the ins and outs of all of these things.

Seed or dbt seed

This is simply some data that you put into a file and make it part of your project. You put it in the seeds folder within your dbt project, but don’t use this as your source to populate your data warehouse, these are typically small files you may use as lookup tables. If you are using an older version of dbt, the folder would be data instead of seeds. That was another source of confusion, so now the term seed and the directory seed are more tightly connected. The format of these files must be CSV, more information can be found via the link above.  

Jinja

Jinja is a templating engine with syntax similar to the Python programming language that allows you to use special placeholders in your SQL code to make it dynamic. The stuff you see with {{ }} is Jinja.

Without Jinja, there is no dbt. I mean it is the combination of Jinja with SQL that gives us the power to do things that would otherwise be very difficult. So, when you see the lineage you get in the dbt documentation, you can thank Jinja for that.

Lineage graph generated by dbt leveraging the source and ref macros
Lineage graph generated by dbt leveraging the source and ref macros

dbt macro

I knew you would have this question. Well, a macro is simply a reusable piece of code. This too adds to the power of dbt. Every newcomer to dbt will quickly learn about the ref and source macros. These are the cornerstone of dbt. They help capture the relationship and sequence of all your data transformations. Sometimes you are using macros and you may not even realize it. Like the not_null test in your yml file, that’s a macro.

Not Null test in a yml file
Not Null test in a yml file
Not Null test macro
Not Null test macro

Behind the scenes, dbt is taking information in your yml file and sending parameters to this macro. In my example, the parameter model gets replaced with base_cases (along with the database name and schema name) and colum_name gets replaced with cases. The compiled version of this test looks like this:

Compiled dbt not null test
Compiled dbt not null test

There are dbt packages like dbt-expectations that extend the core dbt tests by adding a bunch of test macros, so check it out.

dbt package

What do you do when you have a lot of great macros that you want to share with others in the community? You create a dbt package of course.

But what is a dbt package? A package is simply a mini dbt project that can be incorporated into your dbt project via the packages.yml file.  There are a ton of great packages and the first one you will likely run into is dbt-utils. These are handy utilities that will make your life easier. Trust me, go see all the great things in the dbt-utils package.

Packages don’t just have macros though. Remember, they are mini dbt projects, so some packages incorporate some data transformations to help you do your analytics faster. If you and I both need to analyze the performance of our Google Ads, why should we both have to start from scratch?  Well, the fine folks over at Fivetran thought the same thing and created a Google Ads package to help.

When you run the command dbt deps, dbt will look at your packages.yml file and download the specified packages to the dbt_packages directory of your dbt project. If you are on an older version of dbt, packages will be downloaded to the dbt_modules directory instead, but again you can see how this could be confusing hence the updated directory name.

There are many packages and new ones arrive regularly. You can see a full listing on dbt hub.

dbt hub

This is the website maintained by dbt Labs with a listing of dbt packages.  

As a side note, we at Datacoves also maintain a similar listing of Python libraries that enhance the dbt experience in our dbt Libraries page. Check out all the libraries that exist. From additional database adapters to tools that can extract data from your BI tool and connect it with dbt, there’s a wealth of great open-source projects that take dbt to another level. Keep in mind that you cannot install Python libraries on dbt Cloud.

dbt models

These are the SQL files you find in the models directory. These files specify how you want to transform your data. By default, each of these files creates a view in the database, but you can change the materialization of a model to something else and for example, have dbt create a table instead.

Materialization

 Materializations define what dbt will do when it runs your models.  Basically, when you execute dbt run this is what happens.

  1. dbt reads all your files
  2. dbt then compiles the models by replacing the jinja code with the “real” code the database will run e.g. {{ ref(“my_model”) }} becomes my_database.my_schema.my_model
  3. Finally, it wraps the compiled code in the specified materialization, which by default is a view
Original dbt model you create
Original dbt model you create
Compiled model dbt produces. Notice how line 3 was changed to a specific database object
Compiled model dbt produces. Notice how line 3 was changed to a specific database object
Compiled model dbt produces. Notice how line 3 was changed to a specific database object
The code that will actually run in the database is the compiled model code wrapped in the materialization, in this case, a create or replace view statement.

All the code that dbt compiles and runs can be found in the dbt target directory

Target

This term can be ambiguous to a new dbt user. This is because in dbt we use it interchangeably to mean two different things. As I used it above, I meant the directory within your dbt project where dbt commands write their output. If you look in this directory, you will see the compiled and run directories where I found the code I showed above.

dbt target directory
dbt target directory

Now that you know what dbt is doing under the hood, you can look in this directory to see what will be executed in the database. When you need to do some debugging, you should be able to take code directly from the compiled directory and run it on your database.

dbt target

This is the other meaning for target. It refers to where dbt will create/materialize the objects in your database.

Again, dbt first compiles your model code and creates the files in the compiled directory. It then wraps the compiled code with the specified materialization and saves the resulting code in the run directory. Finally, it executes that code in your database target. It is the final file in the run directory that is executed in your database.

Code in the run directory is sent to your database
Code in the run directory is sent to your database

The image above is the code that runs in my Snowflake instance.

But how does dbt know which database target to use? You told it when you set up your dbt profile which is normally stored in a folder called .dbt in your computer's home folder (dbt Cloud and Datacoves both abstract this complexity for you).

dbt Profile

When you start using dbt, you learn of a file called profiles.yml This file has your connection information to the database and should be kept secret as it typically contains your username and password.

This file is called profiles, plural, because you can have more than one profile which you eventually realize is where the target database is defined.  Here is a case where you can argue that a better name for this file is targets.yml, but you will learn later why the name profiles.yml was probably chosen and why this name makes sense.

Two targets defined in profiles.yml
Two targets defined in profiles.yml (database connection details collapsed for brevity)

Notice above that I have two different dbt targets defined below the word outputs, dev and prd.  dbt can only work on one target at a time so if you want to run dbt against two different databases you can specify them here. Just copy the dev target, give it a new name, and change some of the parameters.

Think of the word outputs on line 3 above as targets.  Notice in line 2 the line target: dev this tells dbt which target it should use as your default. In my case, unless I specify otherwise, dbt will use the dev target as my default connection. Hence it will replace the Jinja ref macro with my development database.

Line 3 shows what the ref macro gets replaced with using the default target in the profiles.yml file when compiling this model
Line 3 shows what the ref macro gets replaced with using the default target in the profiles.yml file when compiling this model

How would you use the other target? You simply pass the target parameter to the dbt command like

dbt run --target prd or dbt run -t prd

What is that default: thing on the first line of my profiles.yml file?

My profiles.yml starts with the word default
My profiles.yml starts with the word default

Well you see, that’s the name given to your dbt profile, which by default is well, default.

dbt project

The dbt project is what is created when you create a project via the dbt init command. It includes all of the folders you typically associate with a dbt project and includes a configuration file called dbt_project.yml. If you look at your dbt_project.yml file, you will find something similar to this.

Line 10 shows which profile dbt will use from within your profiles.yml file
Line 10 shows which profile dbt will use from within your profiles.yml file

In line 10 you can see which profile dbt will look for in your profiles.yml file. If I change that line and try to run dbt, I will get an error.

New profile name that does not match what is in my profiles
New profile name that does not match what is in my profiles.yml file
dbt run fails because it didn't find the company a profile in my profiles
dbt run fails because it didn't find the company a profile in my profiles.yml file

NOTE: For those paying close attention, you may have seen I used-s and not -m when selecting a specific model to run.  This is the new/preferred way to select what dbt will run.

So now you see why profiles.yml is called profiles.yml and not targets.yml, because you can have multiple profiles in the file. In practice, I think people normally only have one profile, but nothing is preventing you from creating more and it might be handy if you have multiple dbt projects each with different connection information.

Those smart folks at Fishtown Analytics build in this flexibility for a very specific use case. You see, they were originally an analytics consulting company and developed dbt to help them do their work more efficiently. You can imagine that they were working with multiple clients whose project timelines overlapped so by having multiple profiles they could point each independent dbt project to a different profile in the profiles.yml file with each client's database connection information. Something like this.

profiles.yml with three profiles; default, company_a, and company_b
profiles.yml with three profiles; default, company_a, and company_b

Now that I have a profile called company_a in my profiles.yml that matches what I defined in my dbt_project.yml dbt will run correctly.

dbt_project.yml pointing to a profile called company_a
dbt_project.yml pointing to a profile called company_a
dbt run can now find a profile named company_a so it knows what database connection to use
dbt run can now find a profile named company_a so it knows what database connection to use

Conclusion

There is a ton of stuff to learn in your dbt journey and starting out with a solid foundation can help you better communicate and quickly progress through the learning curve.

Fishtown Analytics, now dbt Labs, created dbt to meet a real need they had and some of their shared vocabularies made it into the names we now use in the community. Those of us who have made it past the initial learning curve sometimes forget how daunting all the terminology can be for a newcomer.

There is a wealth of information you can find in the dbt documentation and our own dbt cheat sheet, but it takes some time to get used to all the new terms and understand how it's all connected. So next time you come across a newbie, think about the term that you are about to use and the mental model they will have when you tell them to update the seed. We need to take our new dbt seeds (people) and mature them into strong trees.

Seedling on a hand

Thumbnail introducing snowcap
5 mins read

Managing Snowflake infrastructure is harder than it should be. The tools that exist either weren’t built for Snowflake specifically, cover only part of what you need to manage, or come with a permission model so opinionated it stops fitting your org somewhere around the second business unit. Most teams end up with a mix: one tool for grants, a script for provisioning, manual clicks for everything else. It works for a while, but doesn't scale well.

Snowcap is an open-source tool built to manage the whole thing in one place.

If you prefer a quick walkthrough, watch the video below.

What Snowcap Is

What Snowcap is

Snowcap is a Snowflake-native infrastructure as code tool, open source and maintained by Datacoves. You define your Snowflake resources in YAML or Python, run snowcap plan to see what would change, and run snowcap apply to make it happen. That’s the loop.

Pro-Tip: if you are using Snowcap against an existing Snowflake account DO NOT sync resources as that will override anything in your account with what is defined in your Snowcap configs. We also advise you always run snowcap plan and validate all snowflake changes before applying them.

There’s no state file. Instead of tracking your infrastructure in a local or remote file that drifts the moment someone makes a change outside the tool, Snowcap queries Snowflake directly on every run. It compares your config against what’s actually in the account and generates the SQL to make them match. If an admin creates a warehouse through the UI, Snowcap knows about it the next time you run. Nothing to reconcile. 

What It Covers

Most IaC tools for Snowflake manage one slice of your account. Snowcap manages the full surface area: warehouses, databases, schemas, roles, users, grants, dynamic tables, hybrid tables, masking policies, row access policies, stages, pipes, streams, tasks, stored procedures, UDFs, integrations, and more. Over 60 resource types in total. 

Configuration scales with your needs. For simple setups, a single YAML file is enough. For complex environments with dozens of business units or regional role variants, Snowcap supports templating: define a pattern once and apply it across a list. What used to mean copy-pasting a config block forty times becomes a single definition. 

Category Examples
Warehouses & Compute Warehouses, resource monitors, compute pools
Access Control Roles, users, grants, database roles
Database Objects Databases, schemas, tables, views, dynamic tables
Security & Policies Masking policies, row access policies, network rules, secrets
Data Loading & Sharing Stages, pipes, streams, shares, replication groups
Functions & Orchestration Tasks, alerts, UDFs, stored procedures
Integrations & Apps Storage, catalog, security integrations, Streamlit, Snowpark

What’s Next

Snowcap is available now. The full documentation is at snowcap.datacoves.com. If you have an existing Snowflake account and want to see what your current setup would look like as config, snowcap export generates it for you, but this may need optimization if you have a Snowflake account with a lot of objects.

The next post in this series covers installation, authentication, and your first real config from scratch. 

Snowcap is how we manage Datacoves deployments. To see how it fits into the full stack, visit datacoves.com/snowflake

Getting started with dbt and snowflake
5 mins read

Setting up dbt with Snowflake takes four steps: install the dbt-snowflake adapter with pip, configure a Snowflake user with key pair authentication, set up profiles.yml, and verify the connection with dbt debug

From there, add a few packages (dbt-coves, dbt_constraints, dbt_semantic_view), install SQLFluff and the right VS Code extensions, and you're ready to build. 

The full setup is straightforward for one developer. It gets expensive across a team, which is where managed dbt platforms come in. 

This guide walks through each step, the tooling that's worth adding, and when it makes sense to stop maintaining the setup yourself. 

What You Need to Run dbt with Snowflake 

Before you can run dbt against Snowflake, you need three things on your machine and one thing in Snowflake: 

On your machine: 

  • Python 3.9 or later. The dbt-snowflake adapter no longer supports older versions. Python 3.11 or 3.12 is a good default. 
  • Git. Required for dbt projects, version control, and CI/CD. If you don't already have it, follow GitHub's setup guide

In Snowflake: 

  • An account where you can create roles, databases, and warehouses, or admin support to do it for you. Do not use ACCOUNTADMIN for day-to-day dbt work. 

That's the short list. The next sections walk through each piece. 

Install Python, Git, and the dbt-snowflake Adapter 

Once Python, Git, and VS Code are installed, the only thing left to install locally is the dbt adapter for Snowflake. 

Use a virtual environment 

Install dbt inside a virtual environment, not against your system Python. A venv keeps your dbt dependencies isolated from other Python projects and makes upgrades safe: 

python -m venv .venv 
source .venv/bin/activate    # macOS/Linux 
.venv\Scripts\activate       # Windows 

Activate the venv every time you work on the project. Tools like uv or pyenv are also worth looking at if you're managing multiple Python versions across projects.

Install dbt-snowflake

Open a terminal and run:

pip install dbt-snowflake 

This installs dbt-core and the Snowflake adapter together. The adapter version pins a compatible dbt-core, so in most cases you don't need to specify versions yourself. 

If you need a specific version for a project that's pinned to an older release, install it explicitly: 

pip install dbt-snowflake==<version number> 

Confirm the install worked:

dbt --version 

You should see both dbt-core and dbt-snowflake listed.

Configure Your Snowflake Account for dbt  

Before dbt can connect to Snowflake, you need a Snowflake user with the right permissions, a role for that user to assume, a database where dbt can build models, and a warehouse for dbt to use as compute. You also need an authentication method. As of late 2025, that means key pair authentication, not a password. 

Create the Role, Database, and Warehouse  

For a typical dbt setup, create a dedicated role, database, and warehouse rather than reusing existing ones. This keeps dbt's footprint isolated and easy to govern. 

Run the following as a user with SECURITYADMIN privileges (or higher, but avoid ACCOUNTADMIN for day-to-day work): 

-- Create a warehouse for dbt compute 
create warehouse transforming 
  warehouse_size = 'xsmall' 
  auto_suspend = 60 
  auto_resume = true 
  initially_suspended = true; 
 
-- Create a database where dbt will build models in development 
create database analytics_dev; 
 
-- Create a role for dbt developers 
create role analyst; 
 
-- Grant ownership of the dev database to the role 
grant ownership on database analytics_dev to role analyst; 
 
-- Grant warehouse usage to the role 
grant usage on warehouse transforming to role analyst; 
 
-- Grant the role to your user 
grant role analyst to user your_username; 

When dbt runs, it creates a schema for each developer inside analytics_dev and uses the transforming warehouse for compute. Production deployments typically use a separate role, database, and warehouse, governed through CI/CD rather than developer accounts. 

For a more comprehensive Snowflake permission model (read-only roles, environment-specific access, masking policies, RBAC at scale), see How to Configure Snowflake for dbt on the dbt blog. We'll also cover infrastructure-as-code options for managing this further down. 

Set Up Key Pair Authentication 

Key pair authentication is the correct default for connecting dbt to Snowflake. As of November 2025, Snowflake enforces MFA on username/password logins, which makes password authentication unworkable for any unattended dbt run. 

Step 1. Generate a key pair on your machine.

# Generate an unencrypted private key 
openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8 -nocrypt 
 
# Generate the matching public key 
openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub 

Windows users: install OpenSSL via Git for Windows (which bundles it). 

For production or CI/CD environments, store the private key in a secrets manager rather than on developer machines. 

Step 2. Register the public key with your Snowflake user. 

In Snowflake, run:

alter user your_username set rsa_public_key='<paste the contents of rsa_key.pub here, without the BEGIN/END lines>'; 

Step 3. Reference the private key from profiles.yml.

dbt supports either a path to the private key file or the key contents inline. We'll set this up in the next section. 

For SSO environments where browser-based authentication is acceptable for local development, externalbrowser is also supported, but it can't be used for unattended runs. For most teams, key pair auth is the consistent answer across local development, CI, and production.

Method Local Dev CI/CD and Production Notes
Key pair (recommended) Yes Yes Consistent across all environments. Works without MFA prompts.
External browser (SSO) Yes No Useful for ad-hoc development. Cannot run unattended.
Username and password Limited No Snowflake now enforces MFA on these logins. Effectively dead for dbt.
OAuth Yes Yes (with extra setup) Strong option for teams already using OAuth with their IdP.

Configure dbt and Verify the Snowflake Connection 

With Snowflake configured, the next step is to point dbt at it. dbt reads connection details from a file called profiles.yml, which lives in your home directory at ~/.dbt/profiles.yml. Project-level Snowflake behavior (table types, query tags, warehouse overrides) lives in dbt_project.yml inside the project itself. 

Initialize the Project with dbt init

If you're starting from scratch, dbt init creates a new project and prompts you for connection details:

dbt init my_project 

If you're cloning an existing project, run dbt init from inside the cloned repo to set up your profiles.yml entry without overwriting the project files. 

The init flow asks for the database type, account identifier, user, authentication method, role, database, warehouse, schema, and threads. The result is a working profiles.yml entry that looks like this: 

my_project: 
  target: dev 
  outputs: 
    dev: 
      type: snowflake 
      account: abc12345.us-east-1 
      user: your_username 
      private_key_path: /Users/your_username/.snowflake/rsa_key.p8 
      role: analyst 
      database: analytics_dev 
      warehouse: transforming 
      schema: dbt_your_username 
      threads: 8 

A few notes: 

  • private_key_path points to wherever you saved the private key you generated. Use the absolute path. The ~/ shorthand isn't always reliable in profiles.yml. 
  • schema is the developer's personal schema. The convention dbt_<username> prevents developers from stepping on each other. 
  • threads controls how many models dbt builds in parallel. 8 is a reasonable starting point. 

If you maintain a project that other developers will clone, add a profile_template.yml at the project root. It pre-fills the fixed values (account, role, database, warehouse) and only prompts each developer for what's truly user-specific (their username, schema, threads). This saves real time across a team. 

Run dbt debug to Verify the Connection  

Before doing anything else, confirm dbt can connect to Snowflake:

dbt debug 

If everything is configured correctly, you'll see All checks passed! at the bottom of the output. If you get an error, the most common causes are: 

  • Wrong account identifier format (Snowflake account IDs vary by region and cloud). 
  • Public key not registered against the user, or registered with the BEGIN/END lines included. 
  • Role missing USAGE on the warehouse or OWNERSHIP on the database. 
  • Wrong private key path, or the key file has restrictive permissions Python can't read. 

If you're stuck, the #db-snowflake channel on the dbt Community Slack is the fastest way to get unstuck. 

Useful profiles.yml Settings to Know

dbt init gives you a working baseline, but a few profiles.yml settings are worth knowing about once you start running dbt regularly: 

  • reuse_connections: true keeps Snowflake connections alive across queries, which speeds up runs noticeably and is especially helpful with SSO. 
  • client_session_keep_alive: true prevents Snowflake from timing out long sessions during big builds. 
  • query_tag sets a default tag on every query dbt issues. This makes it easy to filter dbt activity in QUERY_HISTORY (we'll cover model-level overrides in the next section). 
  • connect_retries and connect_timeout are worth tuning if you hit transient connection failures. 

Full reference: dbt-snowflake profile configuration

Useful dbt_project.yml Settings for Snowflake

Where profiles.yml controls how dbt connects, dbt_project.yml controls how dbt builds against Snowflake. A few Snowflake-specific configs are worth knowing about: 

Transient tables. Snowflake transient tables skip Fail-safe storage, which reduces cost. dbt creates transient tables by default. To make a folder of models permanent (for example, models that need Time Travel beyond one day or Fail-safe protection): 

models: 
  my_project: 
    marts: 
      +transient: false 

Query tags at the model level. Set a default in profiles.yml and override per model or folder in dbt_project.yml

models: 
  my_project: 
    finance: 
      +query_tag: "finance_models" 

Copy grants on rebuild. When dbt rebuilds a table, grants on the previous table are dropped by default. To preserve them:

models: 
  my_project: 
    +copy_grants: true 

Warehouse override. Most models can run on a small warehouse, but a few heavy ones may need more compute. Override per model or folder rather than running everything on a large warehouse:

models: 
  my_project: 
    heavy_marts: 
      +snowflake_warehouse: "transforming_xl" 

This also works for tests, which is useful when you want lightweight tests on a smaller warehouse than your model builds. 

The full list of Snowflake-specific configs lives in the dbt Snowflake configurations reference

dbt Packages and Python Libraries Worth Adding

dbt is most useful when paired with the right packages and Python libraries. The list below isn't exhaustive, but each of these earns its place in a serious dbt-on-Snowflake project. 

Package / Library What It Does When to Add It
dbt-coves CLI that generates staging models, source YAML, and property files from warehouse metadata From day one. Saves hours of boilerplate on every project.
dbt_constraints Turns dbt tests into actual Snowflake primary key, unique key, foreign key, and not-null constraints with RELY Once you have a stable set of tests. Improves query optimization and data modeling tool support.
dbt_semantic_view Adds a semantic_view materialization so Snowflake semantic views are managed in dbt When you're using or planning to use Snowflake semantic views, Cortex Analyst, or Snowflake Intelligence.
SQLFluff SQL linter that understands Jinja and dbt syntax Before the codebase grows past a handful of models. Easier to start clean than retrofit.
dbt-checkpoint Pre-commit hooks that validate model documentation, tests, and naming conventions Before merging the first feature branch. Stops technical debt from compounding.

dbt-coves (Datacoves)

dbt-coves is an open-source CLI tool maintained by Datacoves. It automates the tedious parts of dbt development that nobody enjoys doing by hand: generating source definitions, staging models, property files, and Airflow DAGs from your warehouse metadata. 

Install it with pip:

pip install dbt-coves 

Most teams use it for staging model generation. Point it at a source schema and it produces clean staging models, source YAML, and the matching property files in seconds. For analytics engineers who model dozens of source tables, this saves hours per project. 

dbt-coves also includes utilities for backing up Airbyte and Fivetran configurations, which is useful when you want your ingestion config to live in Git alongside your dbt models. 

dbt_constraints (Snowflake Labs)

dbt_constraints is a Snowflake Labs package that turns your existing dbt tests into actual database constraints. If you've already added unique, not_null, and relationships tests, this package will generate matching primary key, unique key, foreign key, and not-null constraints on Snowflake automatically. 

Add it to packages.yml

packages: 
  - package: Snowflake-Labs/dbt_constraints 
    version: [">=1.0.0", "<2.0.0"] 

Why bother, given that Snowflake doesn't enforce most constraints? 

  • Query performance. Snowflake's optimizer uses primary key, unique key, and foreign key constraints during query rewrite when they're set to RELY. dbt_constraints creates constraints with RELY automatically when the underlying test passes, and NORELY when it fails. The optimizer can use this for join elimination, which removes unnecessary tables from query plans. 
  • Data modeling tools. BI and modeling tools like DBeaver and Oracle SQL Developer Data Modeler can reverse-engineer accurate data model diagrams when constraints exist. Without constraints, those diagrams are guesswork. 
  • Documentation that's always in sync. The constraints in your warehouse match what dbt actually tests. There's no drift between "what the tests say" and "what the database knows." 

dbt_semantic_view (Snowflake Labs)

dbt_semantic_view is a newer Snowflake Labs package that adds a semantic_view materialization to dbt. It lets you define and version-control Snowflake's native semantic views the same way you manage models. 

Add it to packages.yml:

packages: 
  - package: Snowflake-Labs/dbt_semantic_view 
    version: [">=1.0.0", "<2.0.0"] 

A semantic view model looks like this:

{{ config(materialized='semantic_view') }} 
 
TABLES ( 
  orders AS {{ ref('fct_orders') }}, 
  customers AS {{ ref('dim_customers') }} 
) 
 
RELATIONSHIPS ( 
  orders_to_customers AS orders (customer_id) REFERENCES customers (customer_id) 
) 
 
DIMENSIONS ( 
  customers.region AS region, 
  orders.order_date AS order_date 
) 
 
METRICS ( 
  orders.total_revenue AS SUM(orders.amount), 
  orders.order_count AS COUNT(orders.order_id) 
) 

Once materialized, the semantic view is a real Snowflake object. It can be consumed by Cortex Analyst, Snowflake Intelligence, and any tool that queries Snowflake. Because the definition lives in your dbt project, metric logic gets the same Git history, peer review, and CI/CD as your transformations. 

This matters more than it sounds. Most semantic layers either live outside dbt (drift inevitable) or get reinvented in every BI tool (drift guaranteed). Defining the semantic layer in dbt and materializing it natively in Snowflake closes that gap. 

SQLFluff

SQLFluff is the de facto SQL linter for dbt. It enforces formatting and style rules across your project so reviewers can focus on logic, not whether someone used trailing commas or capitalized SQL keywords. 

Install it alongside dbt: 

pip install sqlfluff sqlfluff-templater-dbt 

The sqlfluff-templater-dbt plugin lets SQLFluff understand Jinja, refs, sources, and macros. Without it, the linter chokes on dbt syntax. Configure rules in a .sqlfluff file at the project root, and add a dbt_project.yml reference so the templater can find your project. 

Datacoves sponsors SQLFluff as part of its commitment to open-source dbt tooling. 

dbt-checkpoint (Datacoves)

dbt-checkpoint is a set of pre-commit hooks that validate dbt project quality before code is merged. It catches the things code review usually misses: a model without a description, a column that's documented in YAML but missing from the SQL, a source that's been added without tests. 

Install it as part of your pre-commit setup: 

pip install pre-commit 

Then add the dbt-checkpoint hooks to .pre-commit-config.yaml:

repos: 
  - repo: https://github.com/dbt-checkpoint/dbt-checkpoint 
    rev: v2.0.7 # Verify the latest released version of dbt-checkpoint  
    hooks: 
  	- id: check-model-has-description 
  	- id: check-model-columns-have-desc 
  	- id: check-model-has-tests 
  	- id: check-source-has-freshness 
  	- id: check-script-has-no-table-name 

Run pre-commit install once and the hooks fire automatically on every commit. 

The point isn't to enforce every possible rule. It's to keep technical debt from accumulating before it has a chance to compound. Datacoves maintains dbt-checkpoint as part of the broader dbt ecosystem. 

For a broader look at testing strategy, see An Overview of Testing Options for dbt

VS Code Extensions That Make dbt on Snowflake Easier

VS Code is the default IDE for dbt development. A few extensions turn it from "a nice editor" into a productive dbt workspace. 

Snowflake VS Code Extension

The official Snowflake extension brings the Snowsight experience into VS Code. You can browse databases, run worksheets, view query results, and upload or download files from Snowflake stages, all without leaving the editor. 

For dbt developers, the most useful part is being able to run ad-hoc queries against your warehouse next to the model you're working on. No more flipping between the browser and your IDE every time you need to inspect a column or check a row count. 

Power User for dbt (aka. dbt Power User)

Power User for dbt (formerly called dbt Power User) is the most useful dbt extension. It adds the things dbt should arguably ship with itself: 

  • Run a model, test, or full DAG with a click instead of typing the command. 
  • Preview the result of a model or any selected CTE inline. (contributed by Datacoves) 
  • Click through ref() and source() calls to jump to the underlying file. 
  • See the compiled SQL side-by-side with the Jinja source. (contributed by Datacoves) 
  • Visualize the lineage graph from a model. 

If you only install one extension, install this one. 

SQLFluff Extension

The SQLFluff VS Code extension wires the SQLFluff linter directly into the editor. Linting errors show up inline as you type, with hover descriptions that link to the SQLFluff docs. 

This is the difference between linting being a chore developers run occasionally and linting being something they fix as they write. The former gets ignored. The latter keeps the codebase clean. 

The extension reads from the same .sqlfluff config file that the CLI uses, so there's no duplicate setup. 

Bringing AI Into Your dbt on Snowflake Workflow 

A modern dbt-on-Snowflake AI workflow combines an in-IDE assistant (Power User for dbt, GitHub Copilot, Claude Code) with a Snowflake-native assistant (Snowflake Cortex CLI) and MCP servers that give the AI structured access to your dbt project and warehouse metadata. 

AI has moved past being a novelty in dbt development. Used well, it accelerates the work that doesn't need a human (writing tests, generating documentation, drafting models, explaining errors) and gives developers more time for the work that does (modeling decisions, business logic, architecture). 

A modern dbt-on-Snowflake workflow has a few good options. 

Snowflake Cortex CLI (CoCo). Snowflake's command-line AI assistant runs against your Snowflake account and works like Claude Code or other terminal-based coding assistants. It's particularly useful for dbt because it can find tables and columns, inspect schemas, and generate SQL grounded in your actual warehouse, not a generic LLM guess. 

Read more: Datacoves Expands Snowflake AI Data Cloud Support

Claude Code, GitHub Copilot, OpenAI Codex CLI, Gemini CLI. Each of these works inside VS Code or the terminal. Claude Code and Codex CLI are particularly strong for multi-step refactors across a dbt project. Copilot is hard to beat for inline suggestions. The right choice depends on what your organization already pays for and what data your security team is comfortable sending to which provider. 

MCP servers. Model Context Protocol servers let AI assistants interact with dbt projects, Snowflake, and other tools through a standardized interface. Snowflake and the broader community have shipped MCP servers. Pairing an MCP server with an AI assistant gives the model real awareness of warehouse metadata. 

The thing to avoid is treating AI as a separate workflow. The point is to integrate it into the same VS Code environment where developers already work, with credentials and access already configured. Asking developers to copy-paste between a chat window and their IDE is friction the team will route around within a week. 

This is one of the harder parts of running dbt on Snowflake at scale: keeping AI tooling consistent across developers, with the right credentials, the right MCP servers, and the right governance around what data the AI can see. Datacoves comes preconfigured with Claude Code, Snowflake Cortex CLI, GitHub Copilot, OpenAI Codex CLI, and Gemini CLI inside the in-browser VS Code environment, all working against your Snowflake account with no per-developer setup. For teams that want to standardize how AI shows up in dbt development, that's a meaningful head start. 

Managing Snowflake Infrastructure Alongside dbt

dbt manages objects inside Snowflake (tables, views, tests, documentation). It does not manage Snowflake itself. Roles, users, grants, warehouses, masking policies, and resource monitors live outside dbt's scope and need a separate infrastructure-as-code tool. 

Snowflake roles, users, grants, warehouses, masking policies, row access policies, network policies, resource monitors, and databases all live outside dbt's scope. Most teams handle this with whatever combination of click-ops, Snowsight, and SQL scripts has accumulated over the years. That works until it doesn't. 

The point at which it stops working is usually predictable: 

  • A new analyst joins and needs the right access. Nobody can fully reconstruct what the previous analyst was granted. 
  • A masking policy needs to be applied consistently across thirty tables containing PII. Someone misses three of them. 
  • An audit asks who has OWNERSHIP on production schemas. The answer takes a week to assemble. 
  • A new environment (dev, QA, staging) needs to mirror production. The clone drifts within a sprint because grants are applied manually. 

The fix is to manage Snowflake infrastructure as code, the same way you manage dbt models. Define roles, grants, warehouses, and policies in version-controlled files. Apply changes through pull requests. Let CI/CD enforce that production matches what's in Git. 

Why Terraform isn't a great fit for Snowflake

Terraform is the obvious starting point, but it's the wrong tool for most Snowflake teams. Terraform was built for managing infrastructure across many cloud providers, with a state file as its source of truth. For Snowflake specifically, this creates real problems: 

  • The state file becomes a sync target instead of a record of intent. Drift between Snowflake and the state file happens often, and resolving it is painful. 
  • The Terraform DSL is unfamiliar territory for analytics engineers. Most data teams don't have full-time platform engineers who already speak Terraform. 
  • Snowflake-specific features (RBAC at scale, tag-based masking policies, row access policies) require contortions in Terraform that a Snowflake-native tool can express directly. 

Snowcap: Snowflake-native infrastructure as code

Snowcap is the Snowflake-native IaC tool Datacoves built and maintains as open source. It manages users, roles, grants, warehouses, masking policies, row access policies, and over 60 other Snowflake resource types using YAML or Python configuration. No state file. No DSL to learn. No abstraction layer between your config and Snowflake. 

Snowcap is opinionated where opinion matters most: 

  • RBAC at scale. Define role hierarchies and grants in YAML. Apply consistently across teams, projects, and environments. 
  • Tag-based masking policies. Tag a column once, apply a masking policy to every column with that tag automatically. 
  • Row access policies. Define them once, version them in Git, deploy them like any other Snowflake object. 
  • CI/CD-first. Every change is a pull request. Production state matches what's been merged. 

If dbt is the workshop where you build data products, Snowcap is the power tools that keep the workshop itself in good order. The two work side by side: Snowcap manages who can see what and where compute lives, dbt manages how the data gets transformed. 

For teams already running dbt with Snowflake, adding Snowcap is one of the highest-leverage moves available. It doesn't replace anything you have. It fills the gap that almost every dbt team has but pretends not to: governed, version-controlled, repeatable Snowflake infrastructure. 

When to Stop DIY and Move to Managed dbt

The setup in this guide works. Plenty of teams run it successfully. The honest question isn't whether you can do it yourself. It's whether you should, given what your team is trying to accomplish. 

Here's the pattern most data teams follow: 

At one or two developers, DIY is the right call. The setup is straightforward, the maintenance is low, and the team can iterate on conventions as they go. There's no good reason to add a managed platform at this stage. 

At three to five developers, the cracks start to show. Onboarding a new developer takes a week instead of a day because everyone's local environment is slightly different. Python versions drift. Someone's profiles.yml has a passphrase from 2024 that nobody can find. CI/CD is held together by a YAML file one engineer maintains. It still works, but real time is being lost to platform maintenance. 

At ten or more developers, DIY is expensive. Onboarding tax compounds. Upgrades require coordinating across the whole team. Secrets management becomes a real problem. Multiple dbt projects need governed dependencies. Production runs need an actual orchestrator, not a cron job. CI/CD pipelines need ownership. Someone is now spending a meaningful chunk of their week on platform work that has nothing to do with delivering data products. 

Team Size DIY Verdict What Goes Wrong First
1-2 developers DIY is the right call Nothing. Setup is straightforward, maintenance is low.
3-5 developers DIY starts costing real time Onboarding tax, drifting Python versions, fragile CI/CD. Related: CI/CD With dbt Slim CI.
10+ developers DIY is expensive Platform maintenance becomes someone's part-time job.
Regulated industries (any size) Managed platform with private cloud is usually required SaaS dbt platforms fail security review. DIY adds months of platform engineering.

For regulated industries, DIY runs into a different wall. Pharma, healthcare, financial services, and government workloads usually require private cloud deployment, strict identity controls, audit logging, and architectures that pass internal security review. SaaS dbt platforms are often a non-starter. DIY on Kubernetes is doable, but it pulls in months of platform engineering work before the data team writes a single model. 

The decision isn't really between "DIY" and "managed." It's between who builds and maintains the platform layer. Either your team does it, or someone else does. If platform engineering is your team's competitive advantage, build it yourself. If your team's competitive advantage is delivering data products, the platform layer is overhead. 

See also: dbt Deployment Options

What managed dbt solves

Managed dbt platforms (the category, not the marketing) handle the layer between dbt and the rest of your infrastructure. The good ones cover: 

  • A consistent in-browser or pre-configured VS Code environment, so every developer is on the same Python version, the same dbt version, and the same set of extensions from day one. 
  • Managed Airflow for orchestration, both a personal sandbox for development and a shared production environment. 
  • Pre-built CI/CD pipelines for dbt tests, SQL linting, governance checks, documentation, and deployment. 
  • Secrets management integrated with your existing vault. 
  • Private cloud deployment for teams that need data to stay inside their own network. 
  • Best-practice templates so new projects start with the right structure instead of inventing it. 

Datacoves is the managed dbt platform we build, and the Snowflake integration is one of our most common deployments. Teams running dbt on Snowflake get an end-to-end environment in their own cloud: managed dbt, managed Airflow, in-browser VS Code, CI/CD, governance, and AI tooling, all preconfigured and connected to their Snowflake account. 

For a side-by-side look at the trade-offs, see our comparison of dbt Core vs dbt Cloud

Final Thoughts

dbt and Snowflake is one of the most productive combinations in modern data engineering. The tools fit together, the community is active, and the path from "first model" to "production analytics" is well-trodden. That doesn't mean the path is short. 

The setup itself isn't the hard part. Installing the adapter, configuring authentication, writing profiles.yml, running dbt debug, this is a one-afternoon exercise. The harder part is everything that comes after: keeping ten developers on the same Python version, governing who can do what in Snowflake, integrating AI without creating a mess, deciding which packages are worth their weight, and making the whole thing maintainable as the team grows. 

The tooling in this guide handles most of it. dbt-coves removes the boilerplate. dbt_constraints turns your tests into actual database constraints. dbt_semantic_view brings the semantic layer into your dbt project. SQLFluff and dbt-checkpoint keep code quality from drifting. Power User for dbt makes daily development faster. Snowcap fills the gap dbt was never meant to fill. 

Where it gets expensive is at scale. The setup that works for two developers doesn't scale to twenty without serious investment in the platform layer underneath. Either your team builds and maintains that layer, or you find a managed platform that does it for you. There's no third option that holds up over time. 

If you're running dbt on Snowflake today and the setup is starting to feel heavier than it should, book a free architecture review. We'll discuss your environment, show you where Datacoves fits, and tell you honestly whether it makes sense for where you are. 

Thumbnail data operating model
5 mins read

A Data Operating Model is the set of decisions that define how a company delivers value from data. It covers ownership, team topology, workflows, standards, SLAs, governance, and the platform layer underneath all of it. The tools sit inside the operating model, not above it. 

Most enterprises invest heavily in the tool layer and leave the operating model to emerge on its own during the build. That's the pattern behind nearly every frustrated data leader I talk to: the warehouse works, the transformation tool runs, the SI delivered on the statement of work, and the business still isn't getting what it expected. The absence of a defined operating model before the build started is the usual cause. 

This article explains what a Data Operating Model is, what it includes, why foundational gaps compound instead of resolving themselves, and what to do if you're already mid-build and seeing the symptoms. 

What a Data Operating Model Actually Is 

A Data Operating Model is the blueprint for how your organization turns data into business value. It defines who owns what, how work moves through the system, what standards apply, what "good" looks like, and what the platform underneath must enforce. It sits above the tools and above the architecture. The tools exist to serve the operating model, not the other way around. 

Most executives have never been shown a Data Operating Model in concrete terms, so the concept stays abstract. It shouldn't. An operating model is a finite set of decisions that can be written down, agreed on, and enforced. The reason most enterprises don't have one isn't that it's hard to build. It's that it wasn’t scoped and nobody owned the outcome. 

Two layers of enterprise data platform

The components of a Data Operating Model  

A mature Data Operating Model answers seven questions: who owns what, how teams are structured, how work moves through the system, what standards apply, what SLAs the business expects, how governance is enforced, and what the platform layer underneath has to automate. 

1. Ownership. Who owns each data product? Who owns each source? Who owns the model that joins them? When something breaks, who is accountable? When something needs to change, whose approval is required? Ownership isn't an org chart. It's the map of accountability across every data asset in the business. 

2. Team topology. How do data teams align to the business? Do you have a central data team that services everyone, embedded analytics engineers inside each domain, or a hybrid mesh model? Which decisions are centralized and which are distributed? Team topology is the hardest component to change later, which is why it should be the first decision made. 

3. Workflows. How does a request become a data product? How does a code change get from a developer's laptop to production? How do business users request a new metric? How do downstream teams get access to upstream data? These workflows should be documented, repeatable, and the same across every team. When every team invents their own, you get the naming drift, the cross-team gaps, and the late-surfacing issues that frustrate the business. 

4. Standards. Naming conventions. Layering semantics. Documentation expectations. Testing requirements. Code review rules. Branching strategy. These are the things that make a platform legible to a new engineer on day one instead of week six. Standards that live only in a Confluence page are not standards. They're suggestions. 

5. SLAs. What does the business expect for data freshness? How fast should a new KPI ship? How fast should a new source onboard? What's the acceptable recovery time when a pipeline fails? Without explicit SLAs, every request becomes a negotiation, and every failure becomes a fire drill. 

6. Governance. Who can approve a production deployment? Who signs off on a new data product? How is access granted and reviewed? How are sensitive fields handled? Governance isn't a separate project to start next quarter. It's a dimension of every decision the operating model makes. 

7. Platform layer. The infrastructure underneath all the above. Git workflows. CI/CD. Orchestration. Development environments. Secrets management. Deployment conventions. This layer exists to enforce the operating model automatically, so the team doesn't have to remember to follow the rules. 

Every enterprise already has answers to these seven questions. The difference between a mature operating model and an immature one is whether those answers were decided deliberately, written down, and enforced by the system, or whether they emerged ad hoc as the build progressed. 

How it differs from a data strategy, a platform, or an architecture  

These terms get used interchangeably in executive conversations and they shouldn't be. 

Term What It Is What It Answers
Data Strategy The outcomes you're trying to achieve with data Why
Data Architecture The technical design of how data flows through your systems How, at the system level
Data Platform The collection of tools your teams use to build and run the architecture What
Data Operating Model The set of decisions that determine whether the strategy, architecture, and platform produce the outcomes the business expected How, at the organizational level

Why Most Enterprises Build Before the Operating Model Exists 

If the operating model is this important, why don't enterprises start there? Because the path to a data platform almost never runs through the operating model. It runs through a tool purchase, a vendor pitch, or a business crisis that demands a fast answer. The operating model is the thing that gets skipped because nobody in the room knows to ask for it, and the people selling the build aren't incentivized to slow things down. 

Three patterns show up repeatedly. Each one produces the same outcome: a platform that works technically but doesn't deliver on the business intent. 

Three paths one outcome

The warehouse-first trap  

The first pattern starts with the warehouse. Leadership identifies that the current data infrastructure is too slow, too expensive, or too old. Someone comes back from a Snowflake conference. A decision gets made to modernize. The procurement process kicks off. Within a few months, Snowflake is signed and an implementation partner is scoped. 

The scope is the migration. Move data from the legacy system into Snowflake. Replicate the existing transformation logic. Hit the go-live date. That's what the statement of work says, and that's what gets delivered. 

What isn't in the scope: the operating model. Nobody wrote into the contract that the team would emerge from the engagement with agreed-upon naming conventions, a defined ownership map, documented SLAs, or a governance framework. The warehouse goes live on schedule. The operating model questions are still open eighteen months later, because nobody owned them and nobody was paid to answer them. We've covered what gets missed when the implementation is scoped around the warehouse in more depth. 

The SI-led build  

The second pattern hands the build to a systems integrator (SI) and watches them default to what they know. Every SI has a playbook. Some propose a custom metadata-driven framework. Some build their own Python-based orchestration layer. Some fall back on what they've shipped at ten other clients: heavy stored procedure logic, ELT patterns from a previous engagement, or a homegrown configuration system that mirrors whatever the team's senior architect built fifteen years ago. 

The specific build doesn't matter as much as what the SI is focused on. They're focused on delivering the build. They're not focused on the business outcome the build is supposed to produce. We've seen this pattern documented in detail across enterprise implementations

That distinction is the source of the problem. 

When the engagement is scoped around the framework, the team's energy goes into framework decisions. How should the config tables be structured? What's the deployment mechanism? How do we handle environment promotion? Those are real questions, and they take real effort to answer. What doesn't get asked in the same meetings: Which business units are going to use this, and do they agree on naming? Who owns the data products once they're live? What SLAs is the business expecting? How will cross-team collaboration work when the second and third business units come online? 

The framework ships. The first use cases deliver. The demo goes well. Then the symptoms start. 

The internal team can read the framework but can't extend it without the SI. Framework changes require a new engagement. New capabilities that land in the open-source ecosystem, new Airflow features, new dbt patterns, new CI/CD tooling, don't land in the custom framework unless someone pays the SI to add them. The team is now operating on two clocks: the clock of the open-source world moving forward, and the clock of the custom build moving only when budget is available. 

Meanwhile, the operating model gaps that existed before the SI arrived are still there. The SI wasn't asked to define naming conventions across business units, or to specify how cross-team collaboration should work, or to document who owns what. They were asked to build. So the build got built, the delivery team uses it, and the foundational questions remain unanswered. Now they're harder to address because the system is already in production and the vendor who understands it best is billing by the hour. 

None of this is a critique of SIs as a category. Good SIs exist, and they can deliver real value inside a well-defined operating model. The problem is asking an SI to build a platform before the organization has decided what the platform is supposed to enforce. Under those conditions, the SI will default to what they know how to build. And what they know how to build will calcify around their way of working long after they've rolled off. 

The internal champion's blind spot  

The third pattern doesn't require an SI. It happens when an internal data leader, often passionate and well-intentioned, drives the modernization themselves. They know the business problem. They've seen the pain. They've done their research on the modern data stack. They build the business case and get the budget. 

What they often don't have is deep production experience running a data platform at enterprise scale. They know what outcomes good platforms produce. They haven't necessarily been inside one long enough to see the operating model decisions that make those outcomes possible. 

So the modernization gets shaped around what they know: the warehouse, the transformation tool, maybe a basic orchestration layer. The harder operating model questions, ownership, team topology, SLAs, standards enforcement, cross-team workflows, don't get asked because nobody in the room has been burned by skipping them before. The team inherits a modern tool stack and an immature operating model, and the symptoms start showing up twelve to eighteen months in. 

Buying Snowflake, buying dbt, and hiring an SI does not give you a Data Operating Model. The tools sit inside the operating model, not above it. Starting the build before the operating model is defined, produces a platform that works technically but doesn't deliver on the business intent.

The common thread 

All three patterns share the same structural problem. The build starts before the operating model is defined, and the operating model is expected to emerge on its own during delivery. It doesn't. Operating models don't emerge. They get decided, or they get compensated for. 

The teams that end up with mature operating models aren't the ones who got lucky with their tool choices or their SI. They're the ones who treated the operating model as an explicit deliverable, owned by leadership, scoped at the start of the project, and refined over time as the business learned. That work is not glamorous. It doesn't show up in a conference talk. It's the difference between a platform the business trusts and a platform the business works around. 

What It Looks Like When the Operating Model Is Missing 

The symptoms of a missing operating model are concrete, repeatable, and visible without technical expertise. If your platform has any of them, the operating model is doing less work than the team thinks it is. 

Naming drift across business units 

The same concept gets six different names. CUSTOMER_ORDERS_MONTHLY_US, CUSTOMER_ORDERS_US_MONTHLY, CUSTOMER_ORDERS_MONTHLY_US_FINAL, CUSTOMER_ORDERS_US_MTHLY, and two more variations depending on which team built the model. Every variation is defensible in isolation. Together they make the platform illegible to a new engineer, impossible to govern, and fragile to extend. Naming is the most visible tell of a missing operating model because naming is decided by the operating model. When the operating model is absent, naming is decided by whoever gets there first. 

Downstream teams unable to use the data products they need 

A team needs to answer an ad-hoc question using data that exists in the platform but wasn't shaped for their use case. They can't use the curated layer, so they go upstream and query raw tables directly. They build parallel logic. They duplicate transformations. The platform was supposed to be the source of truth. It's now one of three sources, and the business users don't know which one to trust. 

This is a cross-team workflow problem. The operating model was supposed to define how downstream teams extend the platform, how they request new data products, and what process turns an ad-hoc query into a curated asset. It didn't, so each team invented its own answer. 

GenAI exposes what the operating model never enforced 

Wide tables work reasonably well for operational reporting. A business analyst can find their way around a hundred-column table if they know what they're looking for. GenAI can't. Large language models answering business questions need narrow, purpose-built tables with clean column-level documentation, consistent naming, and traceable lineage. None of that comes from the warehouse. All of it comes from the operating model. 

Enterprises that deferred documentation, skipped column-level descriptions, and let naming drift for three years are discovering that their AI initiative is surfacing every gap at once. The foundation they never built is now the thing blocking the board-mandated priority. 

Unmet requirements surface at UAT, not in design 

Requirements that should have been caught in the design phase land in UAT instead. The business user sees the data and says "that's not what I asked for." The team goes back to rework. The go-live date slips. The credibility of the delivery process erodes. Everyone agrees that requirements gathering needs to be better next time. 

Requirements gathering isn't the problem. The problem is that the operating model never defined how business users participate in data product design, who validates the model before build starts, or what the acceptance criteria look like before UAT begins. Without that definition, the feedback loop closes at the wrong end of the project. 

Governance deferred as a future project 

The executive summary lists "governance" as a Q2 initiative, then a Q3 initiative, then a Q1-next-year initiative. It keeps getting pushed because nobody owns it, nobody scoped it, and it doesn't have a clear business sponsor. Meanwhile, the platform is live. Data products are shipping. Access is being granted through manual tickets. Metadata is being maintained by whoever remembers to maintain it. 

Governance deferred is governance that never happens. The operating model defines governance as a dimension of every decision, not a separate project. When it lives in the future, it stays there. 

Metadata, lineage, and column documentation missing by default 

Nobody decided to skip documentation. It just wasn't on the project plan. Column-level descriptions don't exist because writing them wasn't part of anyone's definition of done. Lineage isn't captured because the framework doesn't surface it automatically and no one has time to maintain it manually. Business users asking "where does this number come from?" get an answer from whichever engineer built the model, if that engineer is still on the team. 

Documentation that depends on discipline is documentation that degrades. The operating model is supposed to make documentation a byproduct of the build, not a deferred task. 

Framework changes require the original builder 

The internal team can use the platform but can't extend it. Every new data source, every new transformation pattern, every new capability requires going back to the SI or the original architect. This dependency was never called out explicitly, but it's now the single biggest constraint on the team's ability to move. And it gets more expensive every quarter. 

Access assignment dependent on manual steps 

A new table or view is created. Someone is supposed to configure it, so the right roles get access. Sometimes that step gets skipped. When it does, the object exists in the warehouse, but access doesn't propagate. Users can’t see the new object. Someone spends a morning figuring out why. The fix is trivial. The pattern repeats next month with a different table. 

The operating model was supposed to decide whether access assignment happens through automation or through a manual checklist. Either answer is defensible. No answer, and automation happens when someone remembers it and breaks when they don't. 

Lower environments drift from production 

DEV is refreshed on an ad-hoc cadence. PRE-PROD is "closer" to PROD but still out of sync. A change passes testing, hits production, and behaves differently because the data shape in production isn't what the team tested against. The business finds out. Trust erodes. 

Environment parity is an operating model decision. Without one, every team defaults to "good enough for today" and the divergence between environments becomes structural. 

Dependencies managed by convention, not by the system 

Pipeline dependencies live in configuration files that developers update as they remember. If an upstream dependency is missing from the config, the data quality checks are the last safety net. When DQ coverage has gaps, the pipeline runs on incomplete data and nobody notices until a downstream user raises a ticket. 

The operating model should have decided whether dependencies are inferred from the code or declared in configuration, and whether a missed declaration fails the build or silently succeeds. Without that decision, the default is "silently succeeds," which is the failure mode nobody wants, and everybody ends up with. 

If three or more of these are familiar, the root cause is a missing operating model. The symptoms are the system telling you so. 

Checklist Control vs Platform Enforcement 

The single most useful frame for diagnosing a data platform is whether the controls that matter are enforced by the system or by people remembering to follow a process. This distinction cuts through every conversation about tools, frameworks, and team maturity. It's also the fastest way to predict how a platform will behave under growth, turnover, and pressure. 

Most enterprise data platforms are checklist-controlled and presented as if they were platform-enforced. The gap between the two is where the symptoms in the previous section come from. 

Checklists control

Checklist control 

A checklist-controlled platform depends on people doing the right thing every time. Naming conventions live in a document that gets read once during onboarding. Access assignment requires someone to update a configuration table after creating a new object. Code quality depends on the reviewer having a good day. Dependencies get declared when the developer remembers to declare them. Documentation happens when there's time. 

This works when the team is small, experienced, and under no time pressure. It degrades the moment any of those three conditions change. A new hire inherits the SOPs but not the instincts. A team lead rolls off and takes the context with them. A deadline compresses and the first thing that gets skipped is whatever depends on discipline rather than on the build itself. 

Every failure in a checklist-controlled platform produces the same diagnosis: someone didn't follow the process. Which is accurate, and beside the point. The real diagnosis is that the platform was designed to require people to follow a process in a place where the system could have enforced it automatically. 

Platform enforcement 

A platform-enforced system makes the wrong action difficult, obvious, or impossible. Naming conventions are validated by CI/CD before a pull request can merge. Access is granted by the system based on rules, not by someone updating a table after the fact. Code quality is enforced by automated linting, testing, and review requirements that run on every commit. Dependencies are inferred from the code and validated against the actual pipeline. Documentation is required for a model to build, not requested after the fact. 

The team doesn't have to remember the rules. The rules are the system. 

This is the difference between a platform that scales and one that doesn't. A platform that depends on discipline gets more fragile as the team grows. A platform that enforces the rules gets stronger as the team grows, because every new engineer inherits the guardrails on day one without reading a document or asking anyone how things work. 

The comparison 

The comparison that matters spans five dimensions: 

Who enforces the control. Checklist platforms rely on people. Enforced platforms rely on the system. People get tired, leave, and forget. Systems don't. 

Category Checklist Control Platform Enforcement
Who enforces control People, following SOPs The system itself
Scaling with team size Degrades. Quality depends on onboarding thoroughness. Scales cleanly. New members inherit guardrails on day one.
Behavior under pressure Discipline is the first thing cut when deadlines compress Rules apply regardless of deadline pressure
Where defects surface Downstream, often found by a business user Blocked before they reach the next stage
Audit and compliance Depends on documentation discipline; drift goes undetected Audit trail is automatic, generated by the system

Why this frame is useful to an executive 

The most useful test of a data platform is this: if a person fails to follow the process, does the system stop them, or does the defect propagate? If the defect propagates, the control is a checklist. It may work today. It won't work at scale. 

A data leader reading a platform architecture document usually can't tell whether the platform is checklist-controlled or enforced. The document will describe controls either way. The test is to read every control and ask: "if a person fails to follow this, does the system stop them, or does the defect propagate?" 

Platforms that look mature in a demo and degrade in production are almost always checklist-controlled platforms. The demo is run by the people who wrote the checklist. The production team is everyone else. 

Why These Gaps Compound at Scale 

The assumption behind most enterprise data platforms is that the foundational issues surfacing today are growing pains. They'll get fixed as the team matures, as the next phase of the build lands, as the governance workstream finally kicks off. This assumption is wrong. 

Foundational gaps don't resolve themselves as the platform grows. They compound. Every new hire inherits the SOPs. Every new business unit multiplies the manual steps. The window to fix foundational issues cheaply closes quickly after go-live. 

Four mechanisms make them worse. 

New hires inherit SOPs, not guardrails 

Every new engineer who joins the team inherits whatever controls are in place on their start date. If the controls are platform-enforced, they inherit the guardrails automatically. The system makes the right action easy and the wrong action difficult. Onboarding becomes a matter of learning the business, not learning which of fourteen naming conventions applies to which business unit. 

If the controls are checklist-based, new engineers inherit a document. Or a wiki. Or a Slack message from someone who remembers how things worked six months ago. The quality of their work becomes a function of how thorough their onboarding was and how carefully they read a Confluence page that may or may not be up to date. 

The more engineers you onboard, the more variation accumulates. Naming drift gets worse with every new hire. Documentation gaps multiply. Cross-team conventions diverge. The team isn't doing anything wrong. They're just operating in a system that produces drift as its default behavior, and the drift is proportional to team size. 

New business units multiply the manual steps 

A platform serving one business unit can absorb a surprising amount of process debt. The people involved know each other. Context gets shared informally. Workarounds get remembered. 

A platform serving four business units cannot. Every manual step that exists in the operating model, registering a new source, assigning access to a new object, declaring a pipeline dependency, updating a config table, reviewing a model against naming conventions, has to happen four times, by four different teams, under four different sets of pressures. The error rate doesn't stay constant. It grows. 

The platform was built to handle the first business unit. The second business unit stressed it. The third exposed the gaps. By the fourth, the team is spending more time coordinating across business units than building for any of them. None of this was visible in the original design. It becomes visible only at the scale where the gaps matter. 

Vendor dependency deepens over time 

Platforms built around a custom SI-delivered framework, a proprietary metadata layer, or a heavily customized orchestration stack produce a specific kind of debt: the debt of vendor knowledge. The people who built the system understand it. Nobody else does. As time passes, the system gets larger, the edges get more ornate, and the cost of explaining it to a new team gets higher. 

The organization reaches a point where it can't extend the platform without the original builder. Every change requires a new engagement. Every new capability has a price tag attached. The open-source world is shipping new Airflow features, new dbt patterns, new CI/CD tooling, and new governance capabilities, none of which land inside the custom framework unless someone pays for the port. The gap between what's possible and what the team can actually use widens every quarter. 

This is not a problem you can engineer your way out of once you're in it. The only way to solve it is to replace the custom layer with something the internal team can own, which is a second transformation program on top of the first. 

Audit and compliance drift becomes undetectable 

Manual processes produce manual records. Agile board tickets. IT change logs. A spreadsheet that tracks who has access to what. Each of these is updated by a person, which means each of them can drift from the actual state of the system without anyone noticing. 

In a small, well-disciplined team, the drift is minor. At enterprise scale, it's structural. Documented controls say one thing. The system is configured another way. Nobody notices until a compliance review surfaces the discrepancy, or until an incident makes it obvious that the access model on paper doesn't match the access model in production. 

The teams that avoid this don't have better discipline. They have infrastructure as code, automated audit trails, and platform-enforced access management. The audit log is a byproduct of the system itself, not a ledger someone has to maintain. 

The time window is shorter than leaders think 

The most dangerous assumption about foundational gaps is that they can be addressed later, once the delivery pressure eases. Delivery pressure never eases. The backlog grows. The business adds new use cases. The board adds an AI mandate. The team that was going to refactor the foundation in Q3 is now fighting fires through Q4. 

Meanwhile, every new data product built on top of the existing foundation inherits the same gaps. Refactoring gets more expensive every month, not less. The window to fix foundational issues cheaply closes quickly after go-live. After that, every fix is a migration, and every migration competes with the delivery work the business is asking for. 

The teams that treat operating model gaps as technical debt to be addressed later are making a bet about time that almost never pays off. The teams that treat operating model gaps as blockers to be addressed now are the ones that come out of the next three years with a platform the business trusts. 

Why GenAI Makes the Operating Model Non-Optional 

Every CEO has a GenAI mandate. Every board is asking about it. And yet a July 2025 MIT NANDA study found that 95% of enterprise GenAI pilots delivered no measurable P&L impact, despite $30–40 billion in enterprise spending. The default assumption behind those investments was that the data foundation was ready. It almost never is. 

GenAI is the forcing function that makes operating model gaps impossible to hide. Wide tables, missing column descriptions, undocumented lineage, and manual access management all break AI workloads before they break human users. 

Wide tables break LLMs faster than they break humans 

A business analyst can work with a hundred-column table. They know what they're looking for, they skip the columns that don't matter, and they ignore the fields with unclear definitions. A large language model can't. When an LLM is given a wide table with inconsistent naming and missing column descriptions, it hallucinates. It picks the column that sounds right. It joins on a field that looks like a key and isn't. The output is confident and wrong. 

The fix is narrow, purpose-built tables with clean semantics. Column names that describe what they contain. Column descriptions that explain business meaning. Consistent naming across related tables. Clear primary and foreign key relationships. These aren't data engineering niceties. They're the minimum viable inputs for AI that produces trustworthy answers. 

Enterprises that spent three years building wide, denormalized operational reporting tables are now discovering that those tables can't be pointed at GenAI directly. They need a second modeling layer, often called a semantic layer, built for AI consumption. That layer takes real work to build. It's a project nobody scoped, running parallel to the existing delivery pressure. 

Column-level documentation is suddenly the critical path 

For years, column descriptions were a nice-to-have. Data catalogs had them when a team made the effort. Documentation quality varied by business unit, by team lead, by quarter. The business mostly worked around the gaps. 

GenAI changes that math. An LLM answering a business question needs to know what every column means. If the column descriptions are missing, stale, or wrong, the model fills in the gaps with plausible-sounding guesses. The answers come back polished and authoritative. The errors are invisible until a business user acts on a wrong number. 

The operating model was supposed to decide that column descriptions are a requirement, not an afterthought. Most operating models didn't. So now the team is writing three years of back-documentation under board pressure, on top of the existing delivery work, for data products that have been live for months. 

Lineage becomes a trust requirement 

When a business user asks an LLM "why is our Q3 revenue in the Northeast region down?", the LLM's answer is only as trustworthy as the lineage of the data it's querying. Where did the number come from? What source fed it? What transformations were applied? Which version of the transformation logic was in effect when the number was computed? 

Platforms without end-to-end lineage can't answer those questions. The business user doesn't know what to trust. The data team can't validate the AI's output. The GenAI initiative produces answers that are confidently wrong, fails an executive review, and gets shelved. 

Lineage is an operating model decision. Platforms that made the decision to capture lineage automatically as part of the build have it. Platforms that deferred lineage to a future governance project don't. And the second category is scrambling. 

Gartner predicts that through 2026, organizations will abandon 60% of AI projects unsupported by AI-ready data. In addition, 63% of organizations either don't have or are unsure about having the right data management practices for AI. This is not a theory, it's already arriving on board agendas. 

Access models break under AI workloads 

Operational reporting has predictable access patterns. A business analyst queries the tables they've been granted access to. A dashboard uses a service account with a defined permission scope. Everyone knows what's authorized and what isn't. 

GenAI workloads don't behave that way. An LLM with access to "the sales data" may try to answer a question by joining across tables that sit in different access tiers. Natural language queries don't respect the access boundaries that were designed for structured SQL. Platforms with manual access assignment and checklist-controlled permissions produce one of two outcomes: AI that can't answer the question because it can't access the data, or AI that answers the question by accessing data it shouldn't have seen. 

Both outcomes are failures. The fix is access management that's granular, automated, and enforced by the system. The operating model was supposed to define that. If it didn't, the GenAI initiative is about to expose exactly which data is governed and which data is governed by accident. 

The compressed timeline 

The operating model problems that felt tolerable in 2023 are intolerable in 2026. The board isn't giving the data team three years to refactor. They're asking for GenAI pilots in six months and production AI in twelve. 

Teams with a defined operating model and a platform that enforces it are shipping those pilots already. They're not scrambling to back-fill documentation, rebuild wide tables into semantic layers, or retrofit access management. The work was done during the build, because the operating model made it part of the build. 

Teams without that foundation are rediscovering every gap under deadline pressure. The AI initiative is failing because the foundation underneath it was never ready, and GenAI is the first workload that refuses to work around the foundation's problems. 

If your data platform has the symptoms described earlier in this article, your GenAI initiative will surface every one of them. On a timeline the business is about to compress. 

What to Do If You're Already Mid-Build 

Most executives reading this article are not at the start of a data platform project. They're twelve, eighteen, twenty-four months in. The warehouse is live. The framework is in production. The first business unit is using it. The symptoms are real, and the question isn't whether the operating model should have been defined earlier. It's what to do now. 

The answer is not to rip everything out. It's also not to accept the current trajectory and hope the next phase of the build compensates for the gaps in the current one. There's a middle path, and it starts with changing what the team is working on, not what it's working with. 

Define the operating model before you write another line of code 

Decisions first. Build second. Even mid-project. 

The operating model is a finite set of decisions. A working session with the right people in the room can get most of the way through the list in a week. What matters is that the decisions get made deliberately and written down, not that they get made perfectly on the first try. 

The decisions that matter most, in order of impact: 

Naming conventions. Pick them. Write them down. Validate them automatically in CI. Every future asset conforms. Existing assets get renamed on a defined schedule. 

Ownership map. Every data product has a named owner. Every source has a named owner. Every shared model has a named owner. If ownership is unclear, that's the first decision to make, not the last. 

Layering semantics. What is raw data? What is a cleaned source? What is a business entity? What is a data product? Four layers, defined crisply, consistent across business units. Not six layers with three teams using them differently. 

Access and environment parity. How is access granted? How is it reviewed? What's the refresh cadence for lower environments? Are DEV and PRE_PROD in sync with PROD, and if not, is that a known and accepted limitation or a problem nobody has prioritized? 

SLAs. What does the business expect? For a new KPI. For a source onboarding. For a production incident. These get documented. Trade-offs get discussed explicitly instead of assumed. 

Cross-team workflows. When the second and third business units onboard, how do they request data products from the central team? How do they extend models the central team owns? How do they avoid duplicating logic that already exists? This is the workflow that scales the platform beyond its first success. 

Governance. Not as a future project. As a dimension of every decision already on this list. Ownership, access, naming, and lineage are all governance. If "governance" is still on the roadmap as a separate workstream, it's already too late. 

The output of this work is a document. Short, explicit, and owned by a named executive. Not a deck. Not a wiki page. A written operating model that the team can point to when decisions come up, and that the platform can enforce. 

Separate the operating model from the infrastructure underneath it 

The team's energy should go into operating model decisions, not rebuilding Git workflows, CI/CD, and orchestration from scratch. 

If the operating model is a finite set of decisions, the infrastructure underneath it is the larger ongoing cost. Git workflows. CI/CD pipelines. Development environments. Secrets management. Orchestration. Deployment standards. Testing frameworks. Every team that builds a serious data platform eventually must build or buy all of it. 

Teams that try to build the operating model and the infrastructure at the same time, with the same people, end up doing neither well. The operating model decisions get rushed because infrastructure is urgent. The infrastructure gets built without operating model clarity because decisions haven't been made yet. Both suffer. 

The teams that succeed separate the two. The operating model is their work. The infrastructure underneath it is either delegated to a platform that's already built or scoped as a distinct workstream with its own ownership. When the team's meeting time is spent on operating model decisions instead of CI/CD configuration, the operating model gets defined faster, and the infrastructure stays consistent with it. 

Ask the diagnostic questions 

The hardest part of acting on a missing operating model is knowing where the gaps are. The executive asking, "is our operating model mature?" is usually not close enough to the platform to answer it. The people close enough to answer are often incentivized to say everything is under control. 

A small set of diagnostic questions surfaces where the operating model is doing work and where it isn't. Answering them honestly takes an hour. The pattern of answers tells you where to focus first. 

On enforcement. Which of your data platform controls are enforced by the system, and which depend on people following a process? If a team member fails to follow the process, does the system stop them, or does the defect reach production? 

On ownership. For every data product in your platform, can you name the owner in under thirty seconds? If not, how many orphans are there, and who inherits them when something breaks? 

On naming and layering. Can a new engineer look at a table name and know what layer it belongs to, which business unit owns it, and what it contains? If not, how much context do they have to ask for before they can do their job? 

On vendor dependency. If the SI or original architect of your platform disengaged tomorrow, could your internal team extend the framework? If not, how much of your roadmap depends on their continued engagement, and what's the cost? 

On governance. Is governance a live dimension of every decision, or is it a future project on a slide deck? If it's a future project, how long has it been there? 

On GenAI readiness. Could your current platform support a GenAI product that a business user would trust with a strategic decision? If not, what specifically is missing, and how long would it take to build? 

On the time window. If you did nothing to change the current trajectory, what does the platform look like in twelve months? If the answer is "worse than today," the operating model work isn't optional. 

How the Platform Layer Enforces the Operating Model 

The operating model is the set of decisions. The platform layer is the system that makes those decisions automatic. Separating the two is how mature data organizations move fast without degrading quality as they scale. 

Datacoves exists because most enterprise data teams are spending their time on the wrong layer. They're rebuilding Git workflows, configuring CI/CD, standing up orchestration, wiring secrets management, and writing deployment conventions from scratch, on top of running the business. That work is necessary. It's also not differentiated. Every enterprise data team needs the same underlying platform capabilities, and every team that builds them in-house takes six to twelve months to get there, plus ongoing maintenance that never ends. 

Datacoves delivers those capabilities preconfigured, inside the customer's private cloud, running on open-source tools the internal team can own. The operating model decisions still belong to the organization. The infrastructure underneath them is already built. 

What the platform enforces out of the box 

Git workflows with branching conventions, pull request requirements, and automated validation on every commit. Naming conventions, testing requirements, and documentation expectations get enforced before code merges. A missed convention doesn't reach production because the system doesn't let it. 

CI/CD pipelines that run dbt tests, SQL linting, governance checks, and deployment validation automatically. Quality becomes a property of the pipeline itself, regardless of how attentive the reviewer is that morning. 

Managed Airflow for orchestration. Pipeline dependencies, retries, failure alerts, and scheduling work consistently across every team. My Airflow for developer testing, Teams Airflow for production. Engineers don't rebuild orchestration conventions for each new project. 

In-browser VS Code environments that come up preconfigured with dbt, Python, SQLFluff, Git integration, and every tool the team needs. A new engineer opens their environment on day one and starts writing code. Onboarding time drops from weeks to hours. 

Secrets management integrated with the customer's existing vault or AWS Secrets Manager. Credentials never live in code. Access is controlled by the system itself. 

Deployment standards that promote code from development through testing to production on the same workflow every time. No manual deployment steps. No scripts that only one person knows how to run. 

Governance enforcement at commit time. dbt-checkpoint catches quality issues before they reach the pipeline. SQLFluff keeps SQL consistent. Naming conventions validate in CI. The team doesn't remember the rules because the system enforces them. 

Why this is the platform enforcement model the article has been describing 

Every control listed above is a system-enforced version of a checklist most enterprise platforms maintain manually. The difference in outcomes is structural, not incremental. A platform that enforces these controls automatically produces consistent quality at any team size. A platform that depends on discipline degrades as the team grows. 

Datacoves is built around the assumption that the operating model is the customer's work, and the infrastructure that enforces the operating model should be the platform's work. That separation is what lets the customer's team spend its time on decisions that differentiate the business, not on infrastructure that every data team needs and no data team should have to build. 

What this means for a mid-build team 

For a team already running on Snowflake with a custom framework or an SI-built platform, Datacoves is the alternative to a second transformation program. Instead of rebuilding the infrastructure layer internally or paying the SI to port new capabilities, the team moves to a platform that already has them. The operating model foundation the team needs to do anyway becomes the focus. The infrastructure underneath it is no longer the team's ongoing cost. 

The customers who've made this move describe the outcome the same way: the engineering team stopped maintaining plumbing and started shipping data products. Guitar Center onboarded in days. Johnson and Johnson described it as a framework accelerator. Those aren't luck. They're the result of a platform layer that enforces the operating model by design. 

If the symptoms earlier in this article match what you're seeing, the next step is a conversation about where the gaps are and what the platform layer can take off your team's plate. Book a free architecture review. The review surfaces the operating model gaps driving the symptoms the business is already complaining about, and it's the fastest way to see whether the platform layer can shorten the path to the outcomes you expected when you started the build. 

A Data Operating Model is the work most enterprises skip because nobody told them it was the work. The tool purchase felt like progress. The SI engagement felt like progress. The first use cases shipping felt like progress. By the time the symptoms surfaced, the decisions that would have prevented them had been deferred long enough to become expensive.  

The executives who get this right aren't smarter than the ones who don't. They're just earlier. They define the operating model before the build starts, or they stop the build long enough to define it once they realize it was never decided. The teams that do that work once ship data products for years afterward. The teams that don't spend those same years compensating for decisions that were never made. 

If the symptoms in this article match what you're seeing in your own platform, the message is simple. The tools aren't failing you. The operating model underneath them is, and it will keep failing until somebody decides to define it. That work is smaller than it looks, it's faster to do than to defer, and it's the only path to the outcomes the business was expecting when the project started. 

Your team has spent eighteen months proving they can build. The next eighteen months are going to be about whether the business trusts what got built. That outcome is decided at the operating model layer, not at the tool layer. The sooner leadership treats it that way, the sooner the symptoms stop. 

Datacoves expands support snowflake ai data cloud
5 mins read
Datacoves enables Snowflake customers to deploy secure, end-to-end data engineering environments with dbt, Airflow, and modern DevOps best practices

Datacoves is expanding its integration with the Snowflake AI Data Cloud, giving Snowflake customers a secure, end-to-end data engineering environment with dbt, Airflow, and modern DevOps best practices, all running inside their own cloud.

This means Snowflake teams get a consistent foundation for development, orchestration, testing, CI/CD, and observability without moving data outside their environment or introducing new security risks.

“Snowflake is the analytical backbone for many of the world’s most data-driven organizations. Datacoves gives those teams a secure and opinionated platform to run modern data engineering practices on top of Snowflake, without forcing them into rigid SaaS tools or DIY infrastructure.”

— Noel Gomez, Co-Founder of Datacoves

What This Means for Snowflake Customers

Organizations using Snowflake can standardize how teams develop and operate analytics workflows using dbt, Airflow, Python, and Git-based workflows while maintaining full control over identity, access, logging, and infrastructure.

Datacoves is commonly used by large enterprises running Snowflake in regulated and complex environments, including life sciences, consumer goods, and financial services. These organizations require private deployment, operational flexibility, and strong engineering foundations.

Snowflake Cortex CLI: Built Into the Datacoves IDE

Datacoves already supports the Snowflake Cortex CLI (CoCo) inside the in-browser VS Code environment with zero setup required for end users. Snowflake credentials are automatically configured through the existing Snowflake extension, so developers can start using CoCo immediately.

The Cortex CLI works like Claude Code but runs on Snowflake’s infrastructure. Within Datacoves, developers can use it to:

•   Query and explore Snowflake data directly from the terminal

•   Generate Python scripts that interact with external APIs and services

•   Find tables, columns, and schema details across Snowflake databases

•   Accelerate development for both Snowflake-specific and general-purpose tasks

Because Datacoves provides a standardized, preconfigured development environment, there’s no installation guesswork. CoCo picks up existing credentials and connections automatically.

Snowcap: Open-Source Infrastructure as Code for Snowflake

Datacoves also maintains Snowcap, an open-source, Snowflake-native infrastructure-as-code tool built from deep experience managing Snowflake at scale.

Snowcap uses YAML or Python configuration, requires no state file, and supports over 60 Snowflake resource types. It includes opinionated accelerators for some of the most complex areas of Snowflake administration:

•   Role-Based Access Control (RBAC) for managing permissions at scale across teams, projects, and environments

•   Tag-Based Masking Policies for applying dynamic data masking consistently across sensitive columns

•   Row Access Policies for controlling row-level security with auditable, version-controlled configurations

These are areas where manual Snowflake administration breaks down fast, especially as teams and data grow. Snowcap brings software engineering discipline to Snowflake governance with CI/CD integration through GitHub Actions.

A Platform Built Around Snowflake

Snowflake handles storage and compute. Datacoves provides the engineering layer that sits on top: managed dbt, managed Airflow, CI/CD, governance, and best practices. Together, they give enterprise teams a complete, production-ready data engineering environment deployed in weeks.

Teams eliminate fragmented environments, inconsistent workflows, and manual platform maintenance. The result is faster onboarding, clearer ownership, and improved visibility across ingestion, transformation, orchestration, and deployment.

To learn more about how Datacoves supports Snowflake teams, book a free architecture review or visit datacoves.com/snowflake.

About Datacoves

Datacoves is an end-to-end data engineering platform that helps organizations deliver secure, high-quality data products with speed and confidence. Deployed inside a customer’s own cloud and enterprise network, Datacoves provides a unified environment for development, orchestration, testing, CI/CD, and observability. It delivers a managed platform for dbt, Airflow, and Python without vendor lock-in.

Thumbnail beyond dbt tests
5 mins read

dbt’s built-in tests cover the fundamentals: uniqueness, nulls, referential integrity, accepted values. But as your project grows, so do the gaps. Anomalies that no one wrote a test for. Code changes that silently break downstream models. Production pipelines that look healthy until a stakeholder finds stale data in a dashboard.

The tools in this guide pick up where basic dbt tests stop. They fall into three categories: pre-production validation (comparing data between environments before code merges), production observability (continuous monitoring of pipeline health over time), and full-stack observability (commercial platforms covering your entire data platform beyond dbt). Some tools span more than one category. The right combination depends on your team's maturity, stack complexity, and where you're losing the most time to data issues today.

This is the companion to An Overview of Testing Options for dbt, which covers everything that ships with dbt Core and the most common testing packages. If you haven’t built out your test suite yet, start there. This guide assumes you have already progressed past dbt data testing.

Where Basic dbt Tests Stop and These Tools Begin

If you’ve followed the dbt testing guide, your project already has generic tests, singular tests, and probably packages like dbt-utils or dbt-expectations for richer assertions. That coverage handles a lot. But it has a ceiling.

Rule-based tests catch what you anticipated. They won’t tell you that row volumes dropped 40% overnight, that a source table stopped arriving on schedule, that a gradual shift in null rates is slowly corrupting a downstream report, or that your “harmless” model refactor just changed 15,000 values in a column no one thought to test.

The tools in this guide fill those gaps. They fall into three categories:

Pre-production validation compares data between your development and production environments before code merges. If a model refactor changes row counts, adds or removes rows, shifts column values, or alters schema structure, these tools surface the specific differences in your PR so reviewers can see the data impact alongside the code change. Tools: dbt-audit-helper, Recce, Datafold.

Production observability monitors your pipeline health continuously after deployment. Instead of testing specific conditions, it builds statistical baselines over time and alerts you when behavior deviates: freshness failures, volume anomalies, schema changes, distribution drift. Tools: Elementary, Soda.

Full-stack observability extends monitoring beyond dbt to cover your entire data platform, including ingestion tools, warehouses, BI layers, and AI workloads. These are commercial platforms for teams where dbt is one piece of a larger stack. Tools: Monte Carlo, Bigeye, Metaplane.

Advanced dbt data quality tools three categories

A complete observability layer tracks four dimensions:

Freshness monitors whether models and sources are updating on schedule. A freshness failure often means an upstream pipeline broke before any dbt test had a chance to run.

Volume tracks whether row counts and event rates behave as expected. Sudden drops or spikes frequently signal upstream issues before any explicit test fires.

Schema detects column additions, removals, renames, and data type changes that can silently break downstream models and dashboards.

Distribution watches the statistical properties of your data over time: null rates, cardinality, value ranges. Gradual drift here can corrupt reports without triggering a single test failure.

Dimension What it tracks Example signal
Freshness Whether models and sources update on schedule Source table hasn’t refreshed in 6 hours
Volume Row counts, record volumes, event rates Orders table dropped 40% overnight
Schema Column additions, removals, type changes A column was renamed upstream without notice
Distribution Null rates, cardinality, value ranges Null rate in customer_id climbed from 0.1% to 12%

Some tools span categories. Elementary installs as a dbt package (making it feel like an extension of your test suite) but its core value is production observability. Datafold started as a data diffing tool but now includes production monitors. The categories describe what problem you’re solving, not rigid product boundaries.

Pre-Production Validation: Catching Problems Before Merge

dbt pull request workflow with data validation

Your dbt tests pass. Your CI pipeline is green. You merge the PR. And then a stakeholder reports that revenue numbers shifted by 12% in a dashboard no one connected to the model you changed.

This happens because dbt tests validate conditions you defined, not the data impact of your code change. A model can pass every test and still produce different data than it did yesterday. Pre-production validation tools close that gap by comparing data between environments before code reaches production.

dbt-audit-helper: Data Diffing Inside Your dbt Project

dbt-audit-helper, maintained by dbt Labs, is a dbt package that compares two relations or queries row by row and column by column. It’s the simplest way to validate that a model refactor, migration, or logic change didn’t introduce unintended differences.

The package provides 10 active macros organized into four groups:

  • Row-level comparison (compare_and_classify_query_results, compare_and_classify_relation_rows) classifies every row as identical, modified, added, or removed, with summary stats and sample records. These are the primary macros for most use cases.
  • Column-level investigation (compare_column_values, compare_all_columns, compare_which_query_columns_differ, compare_which_relation_columns_differ) drills into which specific columns have differences and breaks down match status per column: perfect match, both null, values don't match, null in one side only, missing from one side. Use these after a row comparison reveals mismatches.
  • Schema comparison (compare_relation_columns) compares column names, data types, and ordinal positions between two relations. Useful for catching structural changes during migrations or refactors.
  • Quick identity check (quick_are_queries_identical, quick_are_relations_identical, compare_row_counts) provides fast yes/no answers. The quick macros use hashing for speed (currently Snowflake and BigQuery only). compare_row_counts does a simple count comparison between two relations.

A typical workflow: you refactor a model, run it against your dev environment, then use compare_and_classify_relation_rows to compare the dev output against the production version. If rows show as modified, you drill in with compare_which_relation_columns_differ to find which columns changed, then compare_column_values to understand the specific discrepancies.

dbt-audit-helper is free, open source, and runs entirely inside your dbt project. The tradeoff is that everything is manual. You write SQL files using the macros, run them one model at a time, and read the output in your terminal or warehouse. There's no UI, no PR integration, no automated detection of which models changed. For ad hoc validation during refactoring or migration, it's excellent. For ongoing change management across a team, you'll want Recce or Datafold

Recce: Data-Level PR Review for dbt Teams 

Recce is an open-source data validation toolkit built specifically for dbt PR workflows. Where dbt-audit-helper requires you to write macros and run them manually, Recce automates the comparison and packages the results into a format designed for PR review.

When a developer opens a PR, Recce compares a production baseline against the development branch using a suite of checks:

  • Lineage diff shows which models in the DAG were added, removed, or modified, and flags downstream models as impacted.
  • Row count diff shows whether a model gained or lost rows after the change, and by how much.
  • Schema diff catches column additions, removals, and data type changes in the model output.
  • Value diff samples actual row values between baseline and candidate, useful for catching unintended logic changes.
  • Profile diff compares the statistical shape of a model: null rates, unique value counts, min/max ranges.

As you run checks, Recce lets you add each result to a validation checklist with notes explaining your findings. When you’re ready for review, you export the checklist to your PR comment. The reviewer gets a curated summary of the data impact rather than raw output they have to interpret themselves.

Recce OSS includes all the diff tools, the checklist workflow, and a CLI for CI/CD integration. Recce Cloud (commercial version) adds an AI Data Review Agent that auto-summarizes data impact on every PR, real-time collaboration, automatic checklist sync, and PR gating. For a detailed walkthrough of the workflow, see Recce's data validation toolkit guide.

Datafold: Automated Data Diffing in CI/CD

Datafold is a commercial data engineering platform that automates data diffing as part of your CI/CD pipeline. Both Recce and Datafold run automatically on PRs, but they take different philosophies: Recce lets developers scope and choose which diffs matter, while Datafold diffs every changed model on every PR by default. Datafold's approach gives full coverage with less manual decision-making; Recce's reduces noise by keeping humans in the loop.

Datafold integrates deeply with both dbt Core and dbt Cloud. Its core capabilities:

  • Data diffing in CI/CD automatically diffs changed models and their downstream dependencies on every PR, posting results as a comment
  • Column-level lineage traces impact from dbt models through to BI tools like Looker and Tableau
  • Production monitors track data diff, schema change, and metric anomalies via YAML configuration
  • AI code review enforces SQL standards automatically on pull requests
  • MCP server lets AI coding agents validate their own work against production data
  • Cross-database diffing compares data across different warehouses for migrations

Datafold supports Snowflake, BigQuery, Redshift, Databricks, PostgreSQL, and DuckDB, with cross-database diffing for migrations. VPC deployment is available for teams with strict security requirements.

The open-source data-diff CLI that Datafold previously maintained was deprecated in May 2024. All diffing capabilities now require Datafold Cloud.

How dbt-audit-helper, Recce, and Datafold Compare

Feature dbt-audit-helper Recce Datafold
What it is dbt package (macros) Open-source toolkit + optional Cloud Commercial platform
How you use it Write SQL, run manually, read output Runs in CI/CD, developer picks which diffs to run, exports checklist to PR Runs in CI/CD, auto-diffs every changed model, posts full comment
Philosophy Ad hoc, one model at a time Human-in-the-loop, targeted validation Diff everything, full automation
dbt integration Native (runs inside your project) External (compares two environments) Deep (Core + Cloud, auto-detects changes)
Column-level lineage No Yes (dbt DAG) Yes (extends to BI tools like Looker, Tableau)
UI None (terminal/warehouse output) Web UI + PR comments Web UI + PR comments
CI/CD integration You build it yourself CLI for CI, Cloud for PR gating Built-in, runs automatically
Production monitoring No No Yes (YAML-configurable monitors)
Cost Free (open source) Free (OSS), paid Cloud tier Commercial
Best for Ad hoc refactoring and migration validation Teams wanting PR-level data review with control Teams wanting fully automated data testing in CI/CD

Production Observability: Monitoring Pipeline Health Over Time 

Pre-production validation catches problems before merge. But not every data issue originates from a code change. Sources stop updating. Upstream systems introduce silent schema changes. Row volumes drift gradually over weeks until a report breaks. These are production problems, and they require tools that monitor your pipeline continuously, not just when someone opens a PR.

Elementary: Open-Source Observability for dbt

Elementary is an open-source observability tool built natively on dbt. It installs as a dbt package, runs as part of your project, and stores all observability data directly in your warehouse. No separate infrastructure, no additional warehouse connection. Elementary supports Snowflake, BigQuery, Redshift, Databricks, and PostgreSQL.

Elementary does three things:

Collects and stores test result history. Every dbt test run, including pass/fail status, failure counts, execution time, and the rows that failed, gets written to queryable tables in your warehouse. This gives you trend visibility that dbt’s native artifacts don’t provide.

Adds anomaly detection monitors. Elementary provides dbt-native monitors you configure in YAML, covering row count anomalies, freshness, event freshness (for streaming data), null rate changes, cardinality shifts, and dimension distribution. These use Z-score based statistical detection: Elementary builds a baseline from your historical data (default 14-day training period) and flags values that fall outside the expected range. You can tune sensitivity, time buckets, and training windows per test.

Elementary OSS also includes an AI-powered test (ai_data_validation), currently in beta, that lets you define expectations in plain English. For example, expectation_prompt: "There should be no contract date in the future". Instead of running its own LLM, Elementary uses the AI functions built into your warehouse (Snowflake Cortex, Databricks AI Functions, or BigQuery Vertex AI), so your data never leaves your environment. Setup requires enabling the relevant LLM service in your warehouse first.

An Elementary monitor configuration looks like this:

Elementary config for table level tests
Elementary config for table level tests

Generates a self-hosted observability report. The Elementary CLI produces a rich HTML report you can host on S3, an internal server, or any static file host. It shows model lineage, test results over time, and anomaly alerts in one place. Alerts can be sent to Slack or Microsoft Teams. Full configuration options are in the Elementary docs.

Elementary also includes schema validation tests (detecting deleted or added columns, data type changes, deviations from a configured baseline, JSON schema violations) and exposure validation (detecting column changes that break downstream BI dashboards).

What elementary monitors in your dbt project

OSS vs. Cloud: The features above are all available in Elementary OSS. Elementary Cloud adds automated monitors that require no YAML configuration, column-level lineage extending to BI tools, a built-in data catalog, incident management, AI agents for triage and test recommendations, and a collaborative UI for non-technical users.

Elementary is the right starting point for most dbt teams because it fits inside a workflow you already have. Adding it requires a package installation and a few lines of YAML. If your needs grow beyond what OSS provides, the Cloud tier is the upgrade path.

Soda: Human-Readable Data Quality for Cross-Functional Teams

Soda v4 architecture

Soda is an open-core data quality platform designed so that analysts and business stakeholders can write and own quality checks alongside the engineering team. Where Elementary is built for engineers working inside dbt, Soda is built for shared ownership of data quality across roles.

With the release of Soda v4, the platform has two pillars: Data Testing (proactive, contract-based validation) and Data Observability (reactive, ML-powered monitoring in production). This marks a shift from the earlier CLI-centric approach toward a unified data quality platform.

Soda v4 introduces a Contract Language, a YAML-based format for defining data quality expectations as enforceable agreements between data producers and consumers. A data contract looks like this:

Soda test config
Soda test config

Contracts are verified using Soda Core v4, the open-source Python engine that now functions as a Data Contract Engine. It runs contract verifications locally or in pipelines and supports 50+ built-in data quality checks. Soda Core v4 does not include observability features; those require Soda Cloud or a Soda Agent.

Teams still using SodaCL (the v3 check language) can continue doing so, but new development is centered on the Contract Language. SodaCL documentation is maintained under the Soda v3 docs.

Soda's deployment model has three tiers. Soda Core (open source) runs contract verifications in your pipelines. Soda-hosted Agent is a managed runner that adds observability, scheduling, and the ability to create checks from the Soda Cloud UI. Self-hosted Agent provides the same capabilities deployed in your own Kubernetes environment. Observability features (anomaly detection, metric trending, automated monitoring) require either Agent option plus Soda Cloud.

Soda Cloud is the commercial SaaS layer that adds dashboards, alerting (Slack, MS Teams, Jira, PagerDuty, ServiceNow), collaborative data contracts with role-based ownership, and a UI for both technical and non-technical users.

Soda isn't dbt-native. It works independently and can ingest dbt test results into Soda Cloud for visualization rather than replacing your dbt tests. It integrates with Airflow, Dagster, Prefect, and Azure Data Factory for orchestration, and with Atlan, Alation, and Collibra for data cataloging. It supports Snowflake, BigQuery, Redshift, Databricks, PostgreSQL, DuckDB, and more.

If data quality ownership needs to extend beyond your engineering team, or you need a warehouse-agnostic quality layer that works both inside and outside dbt, Soda is built for that. The producer/consumer contract model is its most meaningful distinction from Elementary.

Full-Stack Observability Beyond dbt 

Elementary and Soda work well when dbt is the center of your data stack. But many organizations run pipelines that span ingestion tools, multiple transformation layers, legacy ETL platforms, and BI tools that dbt never touches. When a data quality issue could originate anywhere in that chain, you need observability that covers the full stack, not just the dbt layer.

Monte Carlo: Enterprise Data + AI Observability

Monte Carlo is a commercial observability platform that connects directly to your warehouse and automatically learns the baseline behavior of your tables using ML. No manual threshold configuration, no YAML. It supports Snowflake, BigQuery, Redshift, and Databricks across all three clouds, plus data lakes via Hive and Glue metastores.

Where Elementary requires you to define each monitor, Monte Carlo deploys monitoring out of the box. It provides automated field-level lineage across your entire stack (not just dbt), integrates with Airflow, Fivetran, Azure Data Factory, Informatica, Databricks Workflows, Prefect, Looker, Tableau, and dbt, and includes centralized incident management.

In 2025, Monte Carlo launched Observability Agents: a Monitoring Agent that recommends and deploys monitors automatically based on data profiling, and a Troubleshooting Agent that investigates root causes by testing hundreds of hypotheses across related tables in parallel. Monte Carlo now also extends monitoring to AI agent inputs and outputs alongside traditional pipeline health.

Monte Carlo’s value compounds as your stack grows beyond dbt. For teams running primarily dbt workloads, the overhead and cost typically outweigh the benefits compared to Elementary. But for large, multi-tool platforms with SLA requirements and dedicated data reliability teams, Monte Carlo is purpose-built.

Bigeye: Enterprise Observability Across Modern and Legacy Stacks

Bigeye is a commercial observability platform that differentiates on lineage depth. After acquiring Data Advantage Group, Bigeye offers end-to-end column-level lineage across both modern cloud warehouses and legacy ETL platforms including Informatica, Talend, SSIS, and IBM DataStage. That makes it a strong fit for enterprises running hybrid stacks where not everything lives in Snowflake or Databricks.

Bigeye provides 70+ data quality monitoring metrics with ML-powered anomaly detection, and supports join-based rules that validate data across tables in different databases. Recent additions include customizable data quality dimensions, PII/PHI detection for sensitive data classification, and an AI Trust platform that applies runtime enforcement to AI data policies.

If your observability needs span legacy ETL systems alongside modern cloud warehouses, or you need cross-database data quality rules and sensitive data detection, Bigeye covers territory that Monte Carlo and Elementary don’t.

Bigeye: Enterprise Observability Across Modern and Legacy Stacks

Metaplane takes a different approach: self-service observability with minimal setup. Connect your warehouse, BI tool, and dbt repo, and Metaplane’s ML engine starts learning from your metadata and generating alerts within days. No manual thresholds, no engineering effort to configure. It was acquired by Datadog in 2024, positioning it as the bridge between application observability and data observability.

Metaplane provides anomaly detection, column-level lineage, schema change detection, and CI/CD support for dbt (impact previews and regression tests in PRs). It also offers a Snowflake native app that lets you pay with existing Snowflake credits.

Metaplane is optimized for modern cloud stacks. Its integrations cover the core of a typical modern data platform: Snowflake, BigQuery, Redshift, Databricks, Clickhouse, and S3 for warehouses and data lakes; PostgreSQL, MySQL, and SQL Server for transactional databases; Fivetran and Airbyte for ingestion; dbt Core and dbt Cloud for transformation; Airflow for orchestration; Census and Hightouch for reverse ETL; Looker, Tableau, PowerBI, Metabase, Mode, Sigma, and Hex for BI; Slack and Jira for notifications.

The tradeoff is scope. Metaplane doesn't cover legacy ETL systems like Informatica, Talend, or SSIS, and its orchestration support is limited to Airflow. For teams with complex hybrid stacks, Bigeye or Monte Carlo may fit better. For modern cloud-native stacks where fast setup matters more than exhaustive coverage, Metaplane is hard to beat. Pricing starts with a free tier, with team plans scaling based on usage.

Great Expectations: Python-First Data Validation

Great Expectations (GX) is the most widely used open-source Python framework for data validation. It’s not dbt-native and it’s not an observability platform. It’s a standalone validation engine for teams that need to define, execute, and document data quality checks across any Python-accessible data source.

GX Core (open source, Apache 2.0) lets you define “Expectations” (data assertions) and run them against Pandas DataFrames, Spark, or any database supported by SQLAlchemy. Results are rendered as auto-generated “Data Docs,” human-readable HTML documentation of what passed and failed. GX integrates with Airflow, Databricks, Snowflake, BigQuery, Redshift, PostgreSQL, and Microsoft Fabric.

GX Cloud (commercial) adds a web UI for managing expectations without code, scheduled validations, alerting, Data Health dashboards, and ExpectAI, which generates expectations from natural language prompts. Currently ExpectAI supports Snowflake, PostgreSQL, Databricks SQL, and Redshift.

The tradeoff is complexity. GX has a steeper learning curve than Elementary or Soda. Its architecture (DataContext, DataSources, ExpectationSuites, Checkpoints, Stores) requires more setup and conceptual overhead than adding a dbt package or writing SodaCL checks. For teams with strong Python skills who want deep, standalone validation across multiple data sources independent of dbt, it remains a solid choice. For dbt-centric teams, Elementary or Soda will get you to value faster.

How to Choose the Right Combination

The right tooling depends on where your team sits on the data quality maturity curve. A five-person analytics engineering team running 50 dbt models doesn’t need Monte Carlo. A platform team managing hundreds of models across multiple ingestion tools, transformation layers, and BI dashboards probably can’t get by with just Elementary.

Data Quality Tool Progression for dbt Teams

← Swipe horizontally to see all tools →

dbt-audit-helper Recce Datafold Elementary Soda Monte Carlo Bigeye Metaplane Great Expectations
Category Pre-production Pre-production Pre-prod + monitoring Production obs Production obs Full-stack obs Full-stack obs Full-stack obs Standalone validation
Runs when Manually (Dev) During PR review Automatically on PR Schedule / CI Schedule / Demand Continuously Continuously Continuously On demand
dbt integration Native (pkg) External Deep (Core + Cloud) Native (pkg) Ingests results Warehouse conn Warehouse conn Whouse + dbt None (Python)
Infrastructure None Dev + Prod env Datafold Cloud Your Warehouse Agent or Cloud Managed Managed Managed Python env
Open source Yes Yes (OSS+) No Yes (OSS+) Open Core No No No Yes (OSS+)
Best for Refactoring & Migrations PR data review Automated CI/CD diffs dbt-first monitoring Cross-team quality Large platforms Hybrid stacks Modern stack setup Python/Non-dbt teams

For most dbt teams, the progression looks like this:

Already have dbt tests and packages? Add dbt-audit-helper for ad hoc data comparison when you refactor models or migrate from legacy SQL. It costs nothing and runs inside your project.

Merging dbt changes regularly and want a safety net? Add Recce if you want an open-source, developer-controlled workflow. Choose Datafold if you want fully automated diffing on every PR with lineage into BI tools.

Need to know when production data goes wrong between deploys? Deploy Elementary. It covers anomaly detection, test result history, and alerting with no infrastructure outside your warehouse.

Data quality ownership extends beyond engineering? Evaluate Soda for its human-readable checks and data contracts.

Stack extends well beyond dbt? Evaluate Monte Carlo for ML-based full-stack coverage, Bigeye for hybrid modern/legacy environments, or Metaplane for fast self-service setup on modern stacks.

These tools aren’t mutually exclusive. The strongest data teams typically run two or three: one for pre-production validation, one for production observability, and sometimes a commercial platform on top for cross-stack coverage. The maturity curve gives you the order. Don’t try to run before you’ve learned to walk.

How Datacoves Supports Your Data Quality Stack

Datacoves doesn't bundle or pre-configure any of the tools in this guide. What it does is provide a managed dbt and Airflow environment that's compatible with all of them. If your team already uses Elementary, Soda, Recce, or any other package, Datacoves supports that workflow without getting in the way.

For example, if a client is running Elementary, Datacoves facilitates the continuity of that tool within its environment. The same applies to Recce in CI/CD, dbt-audit-helper in development, or any other dbt package or external integration. Datacoves doesn't own or maintain these tools, but it ensures they work within a governed, orchestrated platform where your team can connect observability data to Airflow DAG runs, version control history, and deployment pipelines.

The value isn't in pre-installing packages. It's in providing the environment where these tools run reliably alongside everything else your data team needs.

What to Take Away

If your dbt project has basic tests in place and you’re still getting surprised by data issues, you don’t need more tests. You need coverage at different points in the lifecycle.

Before merge: start with dbt-audit-helper for ad hoc comparison, then graduate to Recce or Datafold when your team needs automated PR-level validation.

After deployment: Elementary gives you production anomaly detection, test result history, and alerting inside your existing dbt workflow. It’s the lowest-friction path to observability for most teams.

Beyond dbt: if your stack spans ingestion tools, legacy ETL, and BI layers that dbt doesn’t touch, Monte Carlo, Bigeye, and Metaplane provide the cross-stack coverage. Soda and Great Expectations fit teams that need quality ownership or validation logic outside the dbt ecosystem.

The teams that build the most reliable data platforms aren’t the ones running the most tools. They’re the ones that picked the right tools for the right problems at the right stage of their maturity curve.

This guide is the companion to An Overview of Testing Options for dbt. If you haven’t built your test suite yet, start there. The tools in this article are most valuable when they sit on top of a solid testing foundation.

Snowflake Won't Build Your Data Platform For You Thumbnail
5 mins read

Snowflake is one of the best data warehouses available. But buying it doesn't give you a data platform. A working platform also requires an engineering environment where your team can develop consistently, orchestration to run and monitor pipelines, CI/CD to enforce quality before anything reaches production, and ways of working that make the whole thing maintainable as your team grows. Most Snowflake implementations deliver the warehouse. The platform layer around it, and the practices underneath it, are usually left for your team to figure out after the SI rolls off. That gap is where most implementations quietly fail. 

Buying Snowflake gives you a warehouse. A working data platform requires an engineering environment, orchestration, CI/CD, and ways of working that don't come with the warehouse contract. 

What a Data Platform Actually Requires 

A data warehouse stores and processes data. That's what it was designed to do, and Snowflake does it exceptionally well. 

A data platform does something different. It's the environment where your team develops, tests, deploys, and monitors data products. It includes the tools, the conventions, and the ways of working that determine whether your data is trustworthy, usable, and maintainable at scale. 

The distinction matters because most implementations are scoped around the warehouse. The platform layer gets treated as something that will sort itself out later. It rarely does. 

Think about it in two layers. 

The first is what your users experience: whether they trust the data, whether they can find and understand it, and whether business and technical teams can communicate around it. This includes trustworthiness, usability, collaboration. 

The second is what makes those outcomes possible at the platform level: whether data products can be reused without rebuilding from scratch, whether the system is maintainable when people leave or the team grows, and whether pipelines are reliable enough that failures get caught early instead of surfacing in a meeting. These include reusability, maintainability, reliability. 

Most Snowflake implementations deliver storage and compute. The six outcomes above are what your business expected the platform to produce. They require deliberate work that sits outside the warehouse contract.

Two Layer Framework

Why Snowflake Alone Doesn't Get You There 

Snowflake is excellent at what it does. Fast queries, elastic scaling, clean separation of storage and compute, a strong security model. If your previous warehouse was on-prem or running on aging infrastructure, the difference is real and immediate. 

The problem isn't Snowflake. The expectation that the warehouse is the platform is. 

Snowflake handles storage, compute, and access control. It doesn't give your team a development environment. It doesn't orchestrate your pipelines or tell you when one failed and why. It doesn't enforce naming conventions, testing standards, or deployment rules. It doesn't document your data models or make them understandable to a business analyst who didn't build them. It doesn't define how your team reviews code, manages branches, or promotes changes from development to production. 

Those things aren't gaps in Snowflake's product. They were never Snowflake's job. 

But when leaders evaluate a warehouse and sign a contract, the scope of what they're buying rarely gets articulated clearly. The demos show fast queries and a clean UI. The pitch covers performance benchmarks and cost savings versus the legacy system. Nobody walks through the engineering environment your team will need to build on top of it, because that's not what the vendor is selling. 

So teams buy a best-in-class warehouse and then spend the next six months discovering everything else they need. Some figure it out. Some don't. And most take a long time to get there. 

How Leaders End Up With a Warehouse and Not Much Else 

There are three common paths to a Snowflake implementation. Each one has real strengths. Each one has a predictable blind spot that leads to the same outcome: a warehouse that works, but fails to deliver the expected results. 

Three Paths

The Vendor Marketing Problem 

Snowflake's marketing is good. That's not a criticism, it's an observation. The positioning is clear, the case studies are compelling, and the product genuinely delivers on the core promise. 

What the marketing doesn't cover is everything that sits around the warehouse. That's not Snowflake's job. Their job is to sell Snowflake. The implicit message, though, is that the hard problem is the warehouse. Once that's solved, everything else follows. 

It doesn't. Leaders who build their implementation strategy around the vendor pitch tend to underscope the project from the start. The warehouse gets stood up on time and on budget. The data engineering environment, the orchestration layer, the governance foundation, those get deferred. Sometimes indefinitely. 

The Internal Enthusiasm Problem 

Every organization has at least one person who comes back from a Snowflake conference ready to modernize everything. That enthusiasm is valuable. It's also frequently mis-channeled. 

Internal champions know the business problem well. They've seen the pain. What they often don't have is deep experience building and operating a production data platform from scratch. They know what good outcomes look like. They haven't necessarily seen what a well-built foundation looks like underneath those outcomes. 

So the implementation gets shaped around what they know: the warehouse, the transformation tool, maybe a basic orchestration setup. The harder questions around developer environments, CI/CD, testing standards, secrets management, and deployment conventions don't get asked because nobody in the room has been burned by skipping them before. 

The SI Migration Problem 

A migration is not a platform implementation. The SI's job is to get your data into Snowflake. Whether the environment your team inherits is maintainable and built on sound engineering practices is usually outside the engagement scope. 

System integrators are good at migrations. Moving data from point A to point B, replicating existing logic in a new tool, hitting a go-live date. That's what most of them are scoped and incentivized to deliver. 

It's not that SIs cut corners. It's that "build a production-grade data engineering platform with sustainable ways of working" wasn't in the statement of work. 

What gets handed off is a warehouse with some tables, some transformation logic, and documentation that will be out of date within a month. The team that inherits it then spends the next year figuring out how to operate it at scale. 

If you're evaluating implementation partners, here's what to look for before you sign. 

What Gets Skipped When You Rush the Foundation 

When the implementation is scoped around the warehouse and the migration, a predictable set of things gets deferred. Not because anyone decided they didn't matter, but because they weren't on the project plan. 

Here's what that looks like in practice six to twelve months later. 

Snowflake costs start climbing. Without well-structured data models, query optimization standards, and sensible clustering strategies, warehouses burn credits fast. Teams that skipped the engineering foundation often spend the first year optimizing for cost rather than delivering new capabilities. The savings from migrating off the legacy system quietly get absorbed by an inefficient Snowflake setup. 

Business users don't trust the data. When there are no testing standards, no documentation conventions, and no consistent naming across models, analysts spend more time validating numbers than using them. The platform gets a reputation for being unreliable. People go back to Excel because nobody built the layer that makes data understandable and trustworthy. 

The team can't move fast. Without CI/CD pipelines, code reviews, and deployment guardrails, every change is a risk. Engineers slow down because they're afraid of breaking something. Onboarding a new team member takes weeks because the knowledge lives in people's heads, not in the system. 

Pipelines break in ways nobody sees coming. Without orchestration that handles dependencies, retries, and failure alerts, pipeline failures surface downstream. A business user notices the numbers are wrong before the data team does. That erodes trust fast and is hard to rebuild. 

The foundation debt compounds. Every week that passes without fixing the underlying structure makes it harder to fix. New models get built on top of a shaky base. Refactoring becomes expensive. The team that was supposed to be delivering new data products spends its time maintaining what already exists. 

This is the real cost of the quick win approach. Six months of fast progress followed by years of slow, careful, expensive work to undo the shortcuts. 

We've documented what that looks like in practice here

Tools and Ways of Working Have to Go Together 

Most implementation conversations focus on the tool stack. Which warehouse, which transformation framework, which orchestrator. Those are real decisions and they matter. 

But the teams that deliver reliable data products consistently aren't just using the right tools. They're using them the same way across every engineer on the team. 

That's the ways of working problem. And it's the part nobody puts in the project plan. 

A team with Snowflake and dbt but no agreed branching strategy, no code review process, no testing standards, and no deployment conventions is still fragile. One engineer builds models one way. Another builds them differently. A third inherits both and must figure out which approach is "correct" before they can extend anything. The system never enforced a consistent approach. 

The same applies to orchestration. Airflow is powerful. An Airflow environment where every engineer writes DAGs differently, secrets are managed inconsistently, and there's no standard for how pipeline failures get handled is not an asset. It's a maintenance problem waiting to get worse. 

Good data engineering is a thought-out combination of tools and conventions that work together. The conventions are what make the tools scale beyond the person who set them up. 

This is why the two-layer framework matters in practice. Trustworthiness, usability, and collaboration aren't outcomes you get from buying the right tools. They're outcomes you get when the platform layer underneath, the reusability, maintainability, and reliability, is built deliberately. With both the right tooling and the right ways of working enforced by the system itself, not by people remembering to follow a document. 

The teams that figure this out usually do it the hard way. They run into the problems first, then back into the conventions that would have prevented them. That process can take years and a lot of frustration. Getting the ways of working right from the start compresses that timeline significantly. 

Doing It Right Upfront Is the Fast Path 

Fast Path Debt Path
The teams that move fastest twelve months in are almost always the ones who slowed down at the start. 

The most common objection to investing in the foundation is time. Leaders have stakeholders who want results. Boards want dashboards. The business wants answers. Spending eight weeks building an engineering environment and establishing conventions feels like the opposite of moving fast. 

That instinct is understandable. It's also wrong. 

The teams that move fastest twelve months in are almost always the ones who slowed down at the start. Not forever. For a few weeks. Long enough to get the development environment right, establish the conventions, wire up CI/CD, and make sure the orchestration layer is solid before anyone builds on top of it. 

The teams that skipped that work aren't moving fast. They're managing debt. Every new model gets built carefully because nobody is sure what it might break. Every pipeline change requires manual testing because the automated checks were never put in place. Every new hire takes weeks to get productive because the knowledge lives in people, not in the system. 

A quick start that skips the foundation isn't free. It's a loan at a high interest rate. The payments start small and get larger every month. 

The same logic applies here. A quick start that skips the foundation isn't free. It's a loan at a high interest rate. The payments start small and get larger every month. 

Getting the foundation right upfront doesn't mean months of invisible infrastructure work before anyone sees results. Done well, it takes weeks, not quarters. And what you get on the other side is a team that ships twice a week without being afraid of what they might break, data that business users trust, and a platform that gets easier to extend as it grows rather than harder. 

That's not slow. That's the fast path. 

Before you sign with anyone, there's a specific set of questions worth asking your SI or platform vendor. We covered them in detail here

How Datacoves Compresses the Foundation Work 

Most teams face a choice at the start of a data platform project. Build the foundation properly and accept that it takes time. Or skip it and move fast now, knowing you'll pay for it later. 

Datacoves is built around the idea that you shouldn't have to make that trade-off. 

It's an enterprise data engineering platform that runs inside your private cloud and comes with the foundation pre-built. Managed dbt and Airflow, a VS Code development environment your engineers can open on day one, CI/CD pipelines that enforce quality before anything reaches production, and an architecture built on best practices that your team inherits rather than invents. 

The conventions, the guardrails, the deployment workflows, the secrets management, the testing framework. None of that gets figured out after the fact. It's already there. 

That's what compresses the timeline. Not shortcuts. Not skipping steps. The foundation work is done, and your team starts from a position that most organizations spend a year trying to reach on their own. 

The result is a team that ships consistently from early on, data that business users trust because quality is enforced by the system rather than by people remembering to check, and a platform that gets easier to extend as it grows. 

Guitar Center onboarded in days. Johnson and Johnson described it as a framework accelerator. Those outcomes aren't the result of moving fast and fixing problems later. They're the result of starting with a foundation that didn't need to be fixed. 

Snowflake is a great warehouse. The teams that get the most out of it aren't the ones who bought it and figured out the rest later. They're the ones who treated the platform layer as part of the project from the start. The tool doesn't build the platform. That part is still your decision to make. 

Get our free ebook dbt Cloud vs dbt Core

Get the PDF
Download pdf