
Setting up dbt with Snowflake takes four steps: install the dbt-snowflake adapter with pip, configure a Snowflake user with key pair authentication, set up profiles.yml, and verify the connection with dbt debug.
From there, add a few packages (dbt-coves, dbt_constraints, dbt_semantic_view), install SQLFluff and the right VS Code extensions, and you're ready to build.
The full setup is straightforward for one developer. It gets expensive across a team, which is where managed dbt platforms come in.
This guide walks through each step, the tooling that's worth adding, and when it makes sense to stop maintaining the setup yourself.
Before you can run dbt against Snowflake, you need three things on your machine and one thing in Snowflake:
On your machine:
dbt-snowflake adapter no longer supports older versions. Python 3.11 or 3.12 is a good default. In Snowflake:
ACCOUNTADMIN for day-to-day dbt work. That's the short list. The next sections walk through each piece.
Once Python, Git, and VS Code are installed, the only thing left to install locally is the dbt adapter for Snowflake.
Install dbt inside a virtual environment, not against your system Python. A venv keeps your dbt dependencies isolated from other Python projects and makes upgrades safe:
python -m venv .venv
source .venv/bin/activate # macOS/Linux
.venv\Scripts\activate # Windows Activate the venv every time you work on the project. Tools like uv or pyenv are also worth looking at if you're managing multiple Python versions across projects.
Open a terminal and run:
pip install dbt-snowflake This installs dbt-core and the Snowflake adapter together. The adapter version pins a compatible dbt-core, so in most cases you don't need to specify versions yourself.
If you need a specific version for a project that's pinned to an older release, install it explicitly:
pip install dbt-snowflake==<version number> Confirm the install worked:
dbt --version You should see both dbt-core and dbt-snowflake listed.
Before dbt can connect to Snowflake, you need a Snowflake user with the right permissions, a role for that user to assume, a database where dbt can build models, and a warehouse for dbt to use as compute. You also need an authentication method. As of late 2025, that means key pair authentication, not a password.
For a typical dbt setup, create a dedicated role, database, and warehouse rather than reusing existing ones. This keeps dbt's footprint isolated and easy to govern.
Run the following as a user with SECURITYADMIN privileges (or higher, but avoid ACCOUNTADMIN for day-to-day work):
-- Create a warehouse for dbt compute
create warehouse transforming
warehouse_size = 'xsmall'
auto_suspend = 60
auto_resume = true
initially_suspended = true;
-- Create a database where dbt will build models in development
create database analytics_dev;
-- Create a role for dbt developers
create role analyst;
-- Grant ownership of the dev database to the role
grant ownership on database analytics_dev to role analyst;
-- Grant warehouse usage to the role
grant usage on warehouse transforming to role analyst;
-- Grant the role to your user
grant role analyst to user your_username; When dbt runs, it creates a schema for each developer inside analytics_dev and uses the transforming warehouse for compute. Production deployments typically use a separate role, database, and warehouse, governed through CI/CD rather than developer accounts.
For a more comprehensive Snowflake permission model (read-only roles, environment-specific access, masking policies, RBAC at scale), see How to Configure Snowflake for dbt on the dbt blog. We'll also cover infrastructure-as-code options for managing this further down.
Key pair authentication is the correct default for connecting dbt to Snowflake. As of November 2025, Snowflake enforces MFA on username/password logins, which makes password authentication unworkable for any unattended dbt run.
Step 1. Generate a key pair on your machine.
# Generate an unencrypted private key
openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8 -nocrypt
# Generate the matching public key
openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub Windows users: install OpenSSL via Git for Windows (which bundles it).
For production or CI/CD environments, store the private key in a secrets manager rather than on developer machines.
Step 2. Register the public key with your Snowflake user.
In Snowflake, run:
alter user your_username set rsa_public_key='<paste the contents of rsa_key.pub here, without the BEGIN/END lines>'; Step 3. Reference the private key from profiles.yml.
dbt supports either a path to the private key file or the key contents inline. We'll set this up in the next section.
For SSO environments where browser-based authentication is acceptable for local development, externalbrowser is also supported, but it can't be used for unattended runs. For most teams, key pair auth is the consistent answer across local development, CI, and production.
With Snowflake configured, the next step is to point dbt at it. dbt reads connection details from a file called profiles.yml, which lives in your home directory at ~/.dbt/profiles.yml. Project-level Snowflake behavior (table types, query tags, warehouse overrides) lives in dbt_project.yml inside the project itself.
If you're starting from scratch, dbt init creates a new project and prompts you for connection details:
dbt init my_project If you're cloning an existing project, run dbt init from inside the cloned repo to set up your profiles.yml entry without overwriting the project files.
The init flow asks for the database type, account identifier, user, authentication method, role, database, warehouse, schema, and threads. The result is a working profiles.yml entry that looks like this:
my_project:
target: dev
outputs:
dev:
type: snowflake
account: abc12345.us-east-1
user: your_username
private_key_path: /Users/your_username/.snowflake/rsa_key.p8
role: analyst
database: analytics_dev
warehouse: transforming
schema: dbt_your_username
threads: 8 A few notes:
account value uses the preferred <orgname>-<account_name> format. See Snowflake's account identifier documentation for how to look up your organization name and account name in Snowsight. private_key_path points to wherever you saved the private key you generated. Use the absolute path. The ~/ shorthand isn't always reliable in profiles.yml. schema is the developer's personal schema. The convention dbt_<username> prevents developers from stepping on each other. threads controls how many models dbt builds in parallel. 8 is a reasonable starting point. If you maintain a project that other developers will clone, add a profile_template.yml at the project root. It pre-fills the fixed values (account, role, database, warehouse) and only prompts each developer for what's truly user-specific (their username, schema, threads). This saves real time across a team.
Before doing anything else, confirm dbt can connect to Snowflake:
dbt debug If everything is configured correctly, you'll see All checks passed! at the bottom of the output. If you get an error, the most common causes are:
USAGE on the warehouse or OWNERSHIP on the database. If you're stuck, the #db-snowflake channel on the dbt Community Slack is the fastest way to get unstuck.
dbt init gives you a working baseline, but a few profiles.yml settings are worth knowing about once you start running dbt regularly:
reuse_connections: true keeps Snowflake connections alive across queries, which speeds up runs noticeably and is especially helpful with SSO. client_session_keep_alive: true prevents Snowflake from timing out long sessions during big builds. query_tag sets a default tag on every query dbt issues. This makes it easy to filter dbt activity in QUERY_HISTORY (we'll cover model-level overrides in the next section). connect_retries and connect_timeout are worth tuning if you hit transient connection failures. Full reference: dbt-snowflake profile configuration.
Where profiles.yml controls how dbt connects, dbt_project.yml controls how dbt builds against Snowflake. A few Snowflake-specific configs are worth knowing about:
Transient tables. Snowflake transient tables skip Fail-safe storage, which reduces cost. dbt creates transient tables by default. To make a folder of models permanent (for example, models that need Time Travel beyond one day or Fail-safe protection):
models:
my_project:
marts:
+transient: false Query tags at the model level. Set a default in profiles.yml and override per model or folder in dbt_project.yml:
models:
my_project:
finance:
+query_tag: "finance_models" Copy grants on rebuild. When dbt rebuilds a table, grants on the previous table are dropped by default. To preserve them:
models:
my_project:
+copy_grants: true Warehouse override. Most models can run on a small warehouse, but a few heavy ones may need more compute. Override per model or folder rather than running everything on a large warehouse:
models:
my_project:
heavy_marts:
+snowflake_warehouse: "transforming_xl" This also works for tests, which is useful when you want lightweight tests on a smaller warehouse than your model builds.
The full list of Snowflake-specific configs lives in the dbt Snowflake configurations reference.
dbt is most useful when paired with the right packages and Python libraries. The list below isn't exhaustive, but each of these earns its place in a serious dbt-on-Snowflake project.
dbt-coves is an open-source CLI tool maintained by Datacoves. It automates the tedious parts of dbt development that nobody enjoys doing by hand: generating source definitions, staging models, property files, and Airflow DAGs from your warehouse metadata.
Install it with pip:
pip install dbt-coves Most teams use it for staging model generation. Point it at a source schema and it produces clean staging models, source YAML, and the matching property files in seconds. For analytics engineers who model dozens of source tables, this saves hours per project.
dbt-coves also includes utilities for backing up Airbyte and Fivetran configurations, which is useful when you want your ingestion config to live in Git alongside your dbt models.
dbt_constraints is a Snowflake Labs package that turns your existing dbt tests into actual database constraints. If you've already added unique, not_null, and relationships tests, this package will generate matching primary key, unique key, foreign key, and not-null constraints on Snowflake automatically.
Add it to packages.yml:
packages:
- package: Snowflake-Labs/dbt_constraints
version: [">=1.0.0", "<2.0.0"] Why bother, given that Snowflake doesn't enforce most constraints?
RELY. dbt_constraints creates constraints with RELY automatically when the underlying test passes, and NORELY when it fails. The optimizer can use this for join elimination, which removes unnecessary tables from query plans. dbt_semantic_view is a newer Snowflake Labs package that adds a semantic_view materialization to dbt. It lets you define and version-control Snowflake's native semantic views the same way you manage models.
Add it to packages.yml:
packages:
- package: Snowflake-Labs/dbt_semantic_view
version: [">=1.0.0", "<2.0.0"] A semantic view model looks like this:
{{ config(materialized='semantic_view') }}
TABLES (
orders AS {{ ref('fct_orders') }},
customers AS {{ ref('dim_customers') }}
)
RELATIONSHIPS (
orders_to_customers AS orders (customer_id) REFERENCES customers (customer_id)
)
DIMENSIONS (
customers.region AS region,
orders.order_date AS order_date
)
METRICS (
orders.total_revenue AS SUM(orders.amount),
orders.order_count AS COUNT(orders.order_id)
) Once materialized, the semantic view is a real Snowflake object. It can be consumed by Cortex Analyst, Snowflake Intelligence, and any tool that queries Snowflake. Because the definition lives in your dbt project, metric logic gets the same Git history, peer review, and CI/CD as your transformations.
This matters more than it sounds. Most semantic layers either live outside dbt (drift inevitable) or get reinvented in every BI tool (drift guaranteed). Defining the semantic layer in dbt and materializing it natively in Snowflake closes that gap.
SQLFluff is the de facto SQL linter for dbt. It enforces formatting and style rules across your project so reviewers can focus on logic, not whether someone used trailing commas or capitalized SQL keywords.
Install it alongside dbt:
pip install sqlfluff sqlfluff-templater-dbt The sqlfluff-templater-dbt plugin lets SQLFluff understand Jinja, refs, sources, and macros. Without it, the linter chokes on dbt syntax. Configure rules in a .sqlfluff file at the project root, and add a dbt_project.yml reference so the templater can find your project.
Datacoves sponsors SQLFluff as part of its commitment to open-source dbt tooling.
dbt-checkpoint is a set of pre-commit hooks that validate dbt project quality before code is merged. It catches the things code review usually misses: a model without a description, a column that's documented in YAML but missing from the SQL, a source that's been added without tests.
Install it as part of your pre-commit setup:
pip install pre-commit Then add the dbt-checkpoint hooks to .pre-commit-config.yaml:
repos:
- repo: https://github.com/dbt-checkpoint/dbt-checkpoint
rev: v2.0.7 # Verify the latest released version of dbt-checkpoint
hooks:
- id: check-model-has-description
- id: check-model-columns-have-desc
- id: check-model-has-tests
- id: check-source-has-freshness
- id: check-script-has-no-table-name Run pre-commit install once and the hooks fire automatically on every commit.
The point isn't to enforce every possible rule. It's to keep technical debt from accumulating before it has a chance to compound. Datacoves maintains dbt-checkpoint as part of the broader dbt ecosystem.
For a broader look at testing strategy, see An Overview of Testing Options for dbt.
VS Code is the default IDE for dbt development. A few extensions turn it from "a nice editor" into a productive dbt workspace.
The official Snowflake extension brings the Snowsight experience into VS Code. You can browse databases, run worksheets, view query results, and upload or download files from Snowflake stages, all without leaving the editor.
For dbt developers, the most useful part is being able to run ad-hoc queries against your warehouse next to the model you're working on. No more flipping between the browser and your IDE every time you need to inspect a column or check a row count.
Power User for dbt (formerly called dbt Power User) is the most useful dbt extension. It adds the things dbt should arguably ship with itself:
ref() and source() calls to jump to the underlying file. If you only install one extension, install this one.
The SQLFluff VS Code extension wires the SQLFluff linter directly into the editor. Linting errors show up inline as you type, with hover descriptions that link to the SQLFluff docs.
This is the difference between linting being a chore developers run occasionally and linting being something they fix as they write. The former gets ignored. The latter keeps the codebase clean.
The extension reads from the same .sqlfluff config file that the CLI uses, so there's no duplicate setup.
A modern dbt-on-Snowflake AI workflow combines an in-IDE assistant (Power User for dbt, GitHub Copilot, Claude Code) with a Snowflake-native assistant (Snowflake Cortex CLI) and MCP servers that give the AI structured access to your dbt project and warehouse metadata.
AI has moved past being a novelty in dbt development. Used well, it accelerates the work that doesn't need a human (writing tests, generating documentation, drafting models, explaining errors) and gives developers more time for the work that does (modeling decisions, business logic, architecture).
A modern dbt-on-Snowflake workflow has a few good options.
Snowflake Cortex CLI (CoCo). Snowflake's command-line AI assistant runs against your Snowflake account and works like Claude Code or other terminal-based coding assistants. It's particularly useful for dbt because it can find tables and columns, inspect schemas, and generate SQL grounded in your actual warehouse, not a generic LLM guess.
Read more: Datacoves Expands Snowflake AI Data Cloud Support.
Claude Code, GitHub Copilot, OpenAI Codex CLI, Gemini CLI. Each of these works inside VS Code or the terminal. Claude Code and Codex CLI are particularly strong for multi-step refactors across a dbt project. Copilot is hard to beat for inline suggestions. The right choice depends on what your organization already pays for and what data your security team is comfortable sending to which provider.
MCP servers. Model Context Protocol servers let AI assistants interact with dbt projects, Snowflake, and other tools through a standardized interface. Snowflake and the broader community have shipped MCP servers. Pairing an MCP server with an AI assistant gives the model real awareness of warehouse metadata.
The thing to avoid is treating AI as a separate workflow. The point is to integrate it into the same VS Code environment where developers already work, with credentials and access already configured. Asking developers to copy-paste between a chat window and their IDE is friction the team will route around within a week.
This is one of the harder parts of running dbt on Snowflake at scale: keeping AI tooling consistent across developers, with the right credentials, the right MCP servers, and the right governance around what data the AI can see. Datacoves comes preconfigured with Claude Code, Snowflake Cortex CLI, GitHub Copilot, OpenAI Codex CLI, and Gemini CLI inside the in-browser VS Code environment, all working against your Snowflake account with no per-developer setup. For teams that want to standardize how AI shows up in dbt development, that's a meaningful head start.
dbt manages objects inside Snowflake (tables, views, tests, documentation). It does not manage Snowflake itself. Roles, users, grants, warehouses, masking policies, and resource monitors live outside dbt's scope and need a separate infrastructure-as-code tool.
Snowflake roles, users, grants, warehouses, masking policies, row access policies, network policies, resource monitors, and databases all live outside dbt's scope. Most teams handle this with whatever combination of click-ops, Snowsight, and SQL scripts has accumulated over the years. That works until it doesn't.
The point at which it stops working is usually predictable:
OWNERSHIP on production schemas. The answer takes a week to assemble. The fix is to manage Snowflake infrastructure as code, the same way you manage dbt models. Define roles, grants, warehouses, and policies in version-controlled files. Apply changes through pull requests. Let CI/CD enforce that production matches what's in Git.
Terraform is the obvious starting point, but it's the wrong tool for most Snowflake teams. Terraform was built for managing infrastructure across many cloud providers, with a state file as its source of truth. For Snowflake specifically, this creates real problems:
Snowcap is the Snowflake-native IaC tool Datacoves built and maintains as open source. It manages users, roles, grants, warehouses, masking policies, row access policies, and over 60 other Snowflake resource types using YAML or Python configuration. No state file. No DSL to learn. No abstraction layer between your config and Snowflake.
Snowcap is opinionated where opinion matters most:
If dbt is the workshop where you build data products, Snowcap is the power tools that keep the workshop itself in good order. The two work side by side: Snowcap manages who can see what and where compute lives, dbt manages how the data gets transformed.
For teams already running dbt with Snowflake, adding Snowcap is one of the highest-leverage moves available. It doesn't replace anything you have. It fills the gap that almost every dbt team has but pretends not to: governed, version-controlled, repeatable Snowflake infrastructure.
The setup in this guide works. Plenty of teams run it successfully. The honest question isn't whether you can do it yourself. It's whether you should, given what your team is trying to accomplish.
Here's the pattern most data teams follow:
At one or two developers, DIY is the right call. The setup is straightforward, the maintenance is low, and the team can iterate on conventions as they go. There's no good reason to add a managed platform at this stage.
At three to five developers, the cracks start to show. Onboarding a new developer takes a week instead of a day because everyone's local environment is slightly different. Python versions drift. Someone's profiles.yml has a passphrase from 2024 that nobody can find. CI/CD is held together by a YAML file one engineer maintains. It still works, but real time is being lost to platform maintenance.
At ten or more developers, DIY is expensive. Onboarding tax compounds. Upgrades require coordinating across the whole team. Secrets management becomes a real problem. Multiple dbt projects need governed dependencies. Production runs need an actual orchestrator, not a cron job. CI/CD pipelines need ownership. Someone is now spending a meaningful chunk of their week on platform work that has nothing to do with delivering data products.
For regulated industries, DIY runs into a different wall. Pharma, healthcare, financial services, and government workloads usually require private cloud deployment, strict identity controls, audit logging, and architectures that pass internal security review. SaaS dbt platforms are often a non-starter. DIY on Kubernetes is doable, but it pulls in months of platform engineering work before the data team writes a single model.
The decision isn't really between "DIY" and "managed." It's between who builds and maintains the platform layer. Either your team does it, or someone else does. If platform engineering is your team's competitive advantage, build it yourself. If your team's competitive advantage is delivering data products, the platform layer is overhead.
See also: dbt Deployment Options.
Managed dbt platforms (the category, not the marketing) handle the layer between dbt and the rest of your infrastructure. The good ones cover:
Datacoves is the managed dbt platform we build, and the Snowflake integration is one of our most common deployments. Teams running dbt on Snowflake get an end-to-end environment in their own cloud: managed dbt, managed Airflow, in-browser VS Code, CI/CD, governance, and AI tooling, all preconfigured and connected to their Snowflake account.
For a side-by-side look at the trade-offs, see our comparison of dbt Core vs dbt Cloud.
dbt and Snowflake is one of the most productive combinations in modern data engineering. The tools fit together, the community is active, and the path from "first model" to "production analytics" is well-trodden. That doesn't mean the path is short.
The setup itself isn't the hard part. Installing the adapter, configuring authentication, writing profiles.yml, running dbt debug, this is a one-afternoon exercise. The harder part is everything that comes after: keeping ten developers on the same Python version, governing who can do what in Snowflake, integrating AI without creating a mess, deciding which packages are worth their weight, and making the whole thing maintainable as the team grows.
The tooling in this guide handles most of it. dbt-coves removes the boilerplate. dbt_constraints turns your tests into actual database constraints. dbt_semantic_view brings the semantic layer into your dbt project. SQLFluff and dbt-checkpoint keep code quality from drifting. Power User for dbt makes daily development faster. Snowcap fills the gap dbt was never meant to fill.
Where it gets expensive is at scale. The setup that works for two developers doesn't scale to twenty without serious investment in the platform layer underneath. Either your team builds and maintains that layer, or you find a managed platform that does it for you. There's no third option that holds up over time.
If you're running dbt on Snowflake today and the setup is starting to feel heavier than it should, book a free architecture review. We'll discuss your environment, show you where Datacoves fits, and tell you honestly whether it makes sense for where you are.
A Data Operating Model is the set of decisions that define how a company delivers value from data. It covers ownership, team topology, workflows, standards, SLAs, governance, and the platform layer underneath all of it. The tools sit inside the operating model, not above it.
Most enterprises invest heavily in the tool layer and leave the operating model to emerge on its own during the build. That's the pattern behind nearly every frustrated data leader I talk to: the warehouse works, the transformation tool runs, the SI delivered on the statement of work, and the business still isn't getting what it expected. The absence of a defined operating model before the build started is the usual cause.
This article explains what a Data Operating Model is, what it includes, why foundational gaps compound instead of resolving themselves, and what to do if you're already mid-build and seeing the symptoms.
A Data Operating Model is the blueprint for how your organization turns data into business value. It defines who owns what, how work moves through the system, what standards apply, what "good" looks like, and what the platform underneath must enforce. It sits above the tools and above the architecture. The tools exist to serve the operating model, not the other way around.
Most executives have never been shown a Data Operating Model in concrete terms, so the concept stays abstract. It shouldn't. An operating model is a finite set of decisions that can be written down, agreed on, and enforced. The reason most enterprises don't have one isn't that it's hard to build. It's that it wasn’t scoped and nobody owned the outcome.

A mature Data Operating Model answers seven questions: who owns what, how teams are structured, how work moves through the system, what standards apply, what SLAs the business expects, how governance is enforced, and what the platform layer underneath has to automate.
1. Ownership. Who owns each data product? Who owns each source? Who owns the model that joins them? When something breaks, who is accountable? When something needs to change, whose approval is required? Ownership isn't an org chart. It's the map of accountability across every data asset in the business.
2. Team topology. How do data teams align to the business? Do you have a central data team that services everyone, embedded analytics engineers inside each domain, or a hybrid mesh model? Which decisions are centralized and which are distributed? Team topology is the hardest component to change later, which is why it should be the first decision made.
3. Workflows. How does a request become a data product? How does a code change get from a developer's laptop to production? How do business users request a new metric? How do downstream teams get access to upstream data? These workflows should be documented, repeatable, and the same across every team. When every team invents their own, you get the naming drift, the cross-team gaps, and the late-surfacing issues that frustrate the business.
4. Standards. Naming conventions. Layering semantics. Documentation expectations. Testing requirements. Code review rules. Branching strategy. These are the things that make a platform legible to a new engineer on day one instead of week six. Standards that live only in a Confluence page are not standards. They're suggestions.
5. SLAs. What does the business expect for data freshness? How fast should a new KPI ship? How fast should a new source onboard? What's the acceptable recovery time when a pipeline fails? Without explicit SLAs, every request becomes a negotiation, and every failure becomes a fire drill.
6. Governance. Who can approve a production deployment? Who signs off on a new data product? How is access granted and reviewed? How are sensitive fields handled? Governance isn't a separate project to start next quarter. It's a dimension of every decision the operating model makes.
7. Platform layer. The infrastructure underneath all the above. Git workflows. CI/CD. Orchestration. Development environments. Secrets management. Deployment conventions. This layer exists to enforce the operating model automatically, so the team doesn't have to remember to follow the rules.
Every enterprise already has answers to these seven questions. The difference between a mature operating model and an immature one is whether those answers were decided deliberately, written down, and enforced by the system, or whether they emerged ad hoc as the build progressed.
These terms get used interchangeably in executive conversations and they shouldn't be.
If the operating model is this important, why don't enterprises start there? Because the path to a data platform almost never runs through the operating model. It runs through a tool purchase, a vendor pitch, or a business crisis that demands a fast answer. The operating model is the thing that gets skipped because nobody in the room knows to ask for it, and the people selling the build aren't incentivized to slow things down.
Three patterns show up repeatedly. Each one produces the same outcome: a platform that works technically but doesn't deliver on the business intent.

The first pattern starts with the warehouse. Leadership identifies that the current data infrastructure is too slow, too expensive, or too old. Someone comes back from a Snowflake conference. A decision gets made to modernize. The procurement process kicks off. Within a few months, Snowflake is signed and an implementation partner is scoped.
The scope is the migration. Move data from the legacy system into Snowflake. Replicate the existing transformation logic. Hit the go-live date. That's what the statement of work says, and that's what gets delivered.
What isn't in the scope: the operating model. Nobody wrote into the contract that the team would emerge from the engagement with agreed-upon naming conventions, a defined ownership map, documented SLAs, or a governance framework. The warehouse goes live on schedule. The operating model questions are still open eighteen months later, because nobody owned them and nobody was paid to answer them. We've covered what gets missed when the implementation is scoped around the warehouse in more depth.
The second pattern hands the build to a systems integrator (SI) and watches them default to what they know. Every SI has a playbook. Some propose a custom metadata-driven framework. Some build their own Python-based orchestration layer. Some fall back on what they've shipped at ten other clients: heavy stored procedure logic, ELT patterns from a previous engagement, or a homegrown configuration system that mirrors whatever the team's senior architect built fifteen years ago.
The specific build doesn't matter as much as what the SI is focused on. They're focused on delivering the build. They're not focused on the business outcome the build is supposed to produce. We've seen this pattern documented in detail across enterprise implementations.
That distinction is the source of the problem.
When the engagement is scoped around the framework, the team's energy goes into framework decisions. How should the config tables be structured? What's the deployment mechanism? How do we handle environment promotion? Those are real questions, and they take real effort to answer. What doesn't get asked in the same meetings: Which business units are going to use this, and do they agree on naming? Who owns the data products once they're live? What SLAs is the business expecting? How will cross-team collaboration work when the second and third business units come online?
The framework ships. The first use cases deliver. The demo goes well. Then the symptoms start.
The internal team can read the framework but can't extend it without the SI. Framework changes require a new engagement. New capabilities that land in the open-source ecosystem, new Airflow features, new dbt patterns, new CI/CD tooling, don't land in the custom framework unless someone pays the SI to add them. The team is now operating on two clocks: the clock of the open-source world moving forward, and the clock of the custom build moving only when budget is available.
Meanwhile, the operating model gaps that existed before the SI arrived are still there. The SI wasn't asked to define naming conventions across business units, or to specify how cross-team collaboration should work, or to document who owns what. They were asked to build. So the build got built, the delivery team uses it, and the foundational questions remain unanswered. Now they're harder to address because the system is already in production and the vendor who understands it best is billing by the hour.
None of this is a critique of SIs as a category. Good SIs exist, and they can deliver real value inside a well-defined operating model. The problem is asking an SI to build a platform before the organization has decided what the platform is supposed to enforce. Under those conditions, the SI will default to what they know how to build. And what they know how to build will calcify around their way of working long after they've rolled off.
The third pattern doesn't require an SI. It happens when an internal data leader, often passionate and well-intentioned, drives the modernization themselves. They know the business problem. They've seen the pain. They've done their research on the modern data stack. They build the business case and get the budget.
What they often don't have is deep production experience running a data platform at enterprise scale. They know what outcomes good platforms produce. They haven't necessarily been inside one long enough to see the operating model decisions that make those outcomes possible.
So the modernization gets shaped around what they know: the warehouse, the transformation tool, maybe a basic orchestration layer. The harder operating model questions, ownership, team topology, SLAs, standards enforcement, cross-team workflows, don't get asked because nobody in the room has been burned by skipping them before. The team inherits a modern tool stack and an immature operating model, and the symptoms start showing up twelve to eighteen months in.
Buying Snowflake, buying dbt, and hiring an SI does not give you a Data Operating Model. The tools sit inside the operating model, not above it. Starting the build before the operating model is defined, produces a platform that works technically but doesn't deliver on the business intent.
All three patterns share the same structural problem. The build starts before the operating model is defined, and the operating model is expected to emerge on its own during delivery. It doesn't. Operating models don't emerge. They get decided, or they get compensated for.
The teams that end up with mature operating models aren't the ones who got lucky with their tool choices or their SI. They're the ones who treated the operating model as an explicit deliverable, owned by leadership, scoped at the start of the project, and refined over time as the business learned. That work is not glamorous. It doesn't show up in a conference talk. It's the difference between a platform the business trusts and a platform the business works around.
The symptoms of a missing operating model are concrete, repeatable, and visible without technical expertise. If your platform has any of them, the operating model is doing less work than the team thinks it is.
The same concept gets six different names. CUSTOMER_ORDERS_MONTHLY_US, CUSTOMER_ORDERS_US_MONTHLY, CUSTOMER_ORDERS_MONTHLY_US_FINAL, CUSTOMER_ORDERS_US_MTHLY, and two more variations depending on which team built the model. Every variation is defensible in isolation. Together they make the platform illegible to a new engineer, impossible to govern, and fragile to extend. Naming is the most visible tell of a missing operating model because naming is decided by the operating model. When the operating model is absent, naming is decided by whoever gets there first.
A team needs to answer an ad-hoc question using data that exists in the platform but wasn't shaped for their use case. They can't use the curated layer, so they go upstream and query raw tables directly. They build parallel logic. They duplicate transformations. The platform was supposed to be the source of truth. It's now one of three sources, and the business users don't know which one to trust.
This is a cross-team workflow problem. The operating model was supposed to define how downstream teams extend the platform, how they request new data products, and what process turns an ad-hoc query into a curated asset. It didn't, so each team invented its own answer.
Wide tables work reasonably well for operational reporting. A business analyst can find their way around a hundred-column table if they know what they're looking for. GenAI can't. Large language models answering business questions need narrow, purpose-built tables with clean column-level documentation, consistent naming, and traceable lineage. None of that comes from the warehouse. All of it comes from the operating model.
Enterprises that deferred documentation, skipped column-level descriptions, and let naming drift for three years are discovering that their AI initiative is surfacing every gap at once. The foundation they never built is now the thing blocking the board-mandated priority.
Requirements that should have been caught in the design phase land in UAT instead. The business user sees the data and says "that's not what I asked for." The team goes back to rework. The go-live date slips. The credibility of the delivery process erodes. Everyone agrees that requirements gathering needs to be better next time.
Requirements gathering isn't the problem. The problem is that the operating model never defined how business users participate in data product design, who validates the model before build starts, or what the acceptance criteria look like before UAT begins. Without that definition, the feedback loop closes at the wrong end of the project.
The executive summary lists "governance" as a Q2 initiative, then a Q3 initiative, then a Q1-next-year initiative. It keeps getting pushed because nobody owns it, nobody scoped it, and it doesn't have a clear business sponsor. Meanwhile, the platform is live. Data products are shipping. Access is being granted through manual tickets. Metadata is being maintained by whoever remembers to maintain it.
Governance deferred is governance that never happens. The operating model defines governance as a dimension of every decision, not a separate project. When it lives in the future, it stays there.
Nobody decided to skip documentation. It just wasn't on the project plan. Column-level descriptions don't exist because writing them wasn't part of anyone's definition of done. Lineage isn't captured because the framework doesn't surface it automatically and no one has time to maintain it manually. Business users asking "where does this number come from?" get an answer from whichever engineer built the model, if that engineer is still on the team.
Documentation that depends on discipline is documentation that degrades. The operating model is supposed to make documentation a byproduct of the build, not a deferred task.
The internal team can use the platform but can't extend it. Every new data source, every new transformation pattern, every new capability requires going back to the SI or the original architect. This dependency was never called out explicitly, but it's now the single biggest constraint on the team's ability to move. And it gets more expensive every quarter.
A new table or view is created. Someone is supposed to configure it, so the right roles get access. Sometimes that step gets skipped. When it does, the object exists in the warehouse, but access doesn't propagate. Users can’t see the new object. Someone spends a morning figuring out why. The fix is trivial. The pattern repeats next month with a different table.
The operating model was supposed to decide whether access assignment happens through automation or through a manual checklist. Either answer is defensible. No answer, and automation happens when someone remembers it and breaks when they don't.
DEV is refreshed on an ad-hoc cadence. PRE-PROD is "closer" to PROD but still out of sync. A change passes testing, hits production, and behaves differently because the data shape in production isn't what the team tested against. The business finds out. Trust erodes.
Environment parity is an operating model decision. Without one, every team defaults to "good enough for today" and the divergence between environments becomes structural.
Pipeline dependencies live in configuration files that developers update as they remember. If an upstream dependency is missing from the config, the data quality checks are the last safety net. When DQ coverage has gaps, the pipeline runs on incomplete data and nobody notices until a downstream user raises a ticket.
The operating model should have decided whether dependencies are inferred from the code or declared in configuration, and whether a missed declaration fails the build or silently succeeds. Without that decision, the default is "silently succeeds," which is the failure mode nobody wants, and everybody ends up with.
If three or more of these are familiar, the root cause is a missing operating model. The symptoms are the system telling you so.
The single most useful frame for diagnosing a data platform is whether the controls that matter are enforced by the system or by people remembering to follow a process. This distinction cuts through every conversation about tools, frameworks, and team maturity. It's also the fastest way to predict how a platform will behave under growth, turnover, and pressure.
Most enterprise data platforms are checklist-controlled and presented as if they were platform-enforced. The gap between the two is where the symptoms in the previous section come from.

A checklist-controlled platform depends on people doing the right thing every time. Naming conventions live in a document that gets read once during onboarding. Access assignment requires someone to update a configuration table after creating a new object. Code quality depends on the reviewer having a good day. Dependencies get declared when the developer remembers to declare them. Documentation happens when there's time.
This works when the team is small, experienced, and under no time pressure. It degrades the moment any of those three conditions change. A new hire inherits the SOPs but not the instincts. A team lead rolls off and takes the context with them. A deadline compresses and the first thing that gets skipped is whatever depends on discipline rather than on the build itself.
Every failure in a checklist-controlled platform produces the same diagnosis: someone didn't follow the process. Which is accurate, and beside the point. The real diagnosis is that the platform was designed to require people to follow a process in a place where the system could have enforced it automatically.
A platform-enforced system makes the wrong action difficult, obvious, or impossible. Naming conventions are validated by CI/CD before a pull request can merge. Access is granted by the system based on rules, not by someone updating a table after the fact. Code quality is enforced by automated linting, testing, and review requirements that run on every commit. Dependencies are inferred from the code and validated against the actual pipeline. Documentation is required for a model to build, not requested after the fact.
The team doesn't have to remember the rules. The rules are the system.
This is the difference between a platform that scales and one that doesn't. A platform that depends on discipline gets more fragile as the team grows. A platform that enforces the rules gets stronger as the team grows, because every new engineer inherits the guardrails on day one without reading a document or asking anyone how things work.
The comparison that matters spans five dimensions:
Who enforces the control. Checklist platforms rely on people. Enforced platforms rely on the system. People get tired, leave, and forget. Systems don't.
The most useful test of a data platform is this: if a person fails to follow the process, does the system stop them, or does the defect propagate? If the defect propagates, the control is a checklist. It may work today. It won't work at scale.
A data leader reading a platform architecture document usually can't tell whether the platform is checklist-controlled or enforced. The document will describe controls either way. The test is to read every control and ask: "if a person fails to follow this, does the system stop them, or does the defect propagate?"
Platforms that look mature in a demo and degrade in production are almost always checklist-controlled platforms. The demo is run by the people who wrote the checklist. The production team is everyone else.
The assumption behind most enterprise data platforms is that the foundational issues surfacing today are growing pains. They'll get fixed as the team matures, as the next phase of the build lands, as the governance workstream finally kicks off. This assumption is wrong.
Foundational gaps don't resolve themselves as the platform grows. They compound. Every new hire inherits the SOPs. Every new business unit multiplies the manual steps. The window to fix foundational issues cheaply closes quickly after go-live.
Four mechanisms make them worse.
Every new engineer who joins the team inherits whatever controls are in place on their start date. If the controls are platform-enforced, they inherit the guardrails automatically. The system makes the right action easy and the wrong action difficult. Onboarding becomes a matter of learning the business, not learning which of fourteen naming conventions applies to which business unit.
If the controls are checklist-based, new engineers inherit a document. Or a wiki. Or a Slack message from someone who remembers how things worked six months ago. The quality of their work becomes a function of how thorough their onboarding was and how carefully they read a Confluence page that may or may not be up to date.
The more engineers you onboard, the more variation accumulates. Naming drift gets worse with every new hire. Documentation gaps multiply. Cross-team conventions diverge. The team isn't doing anything wrong. They're just operating in a system that produces drift as its default behavior, and the drift is proportional to team size.
A platform serving one business unit can absorb a surprising amount of process debt. The people involved know each other. Context gets shared informally. Workarounds get remembered.
A platform serving four business units cannot. Every manual step that exists in the operating model, registering a new source, assigning access to a new object, declaring a pipeline dependency, updating a config table, reviewing a model against naming conventions, has to happen four times, by four different teams, under four different sets of pressures. The error rate doesn't stay constant. It grows.
The platform was built to handle the first business unit. The second business unit stressed it. The third exposed the gaps. By the fourth, the team is spending more time coordinating across business units than building for any of them. None of this was visible in the original design. It becomes visible only at the scale where the gaps matter.
Platforms built around a custom SI-delivered framework, a proprietary metadata layer, or a heavily customized orchestration stack produce a specific kind of debt: the debt of vendor knowledge. The people who built the system understand it. Nobody else does. As time passes, the system gets larger, the edges get more ornate, and the cost of explaining it to a new team gets higher.
The organization reaches a point where it can't extend the platform without the original builder. Every change requires a new engagement. Every new capability has a price tag attached. The open-source world is shipping new Airflow features, new dbt patterns, new CI/CD tooling, and new governance capabilities, none of which land inside the custom framework unless someone pays for the port. The gap between what's possible and what the team can actually use widens every quarter.
This is not a problem you can engineer your way out of once you're in it. The only way to solve it is to replace the custom layer with something the internal team can own, which is a second transformation program on top of the first.
Manual processes produce manual records. Agile board tickets. IT change logs. A spreadsheet that tracks who has access to what. Each of these is updated by a person, which means each of them can drift from the actual state of the system without anyone noticing.
In a small, well-disciplined team, the drift is minor. At enterprise scale, it's structural. Documented controls say one thing. The system is configured another way. Nobody notices until a compliance review surfaces the discrepancy, or until an incident makes it obvious that the access model on paper doesn't match the access model in production.
The teams that avoid this don't have better discipline. They have infrastructure as code, automated audit trails, and platform-enforced access management. The audit log is a byproduct of the system itself, not a ledger someone has to maintain.
The most dangerous assumption about foundational gaps is that they can be addressed later, once the delivery pressure eases. Delivery pressure never eases. The backlog grows. The business adds new use cases. The board adds an AI mandate. The team that was going to refactor the foundation in Q3 is now fighting fires through Q4.
Meanwhile, every new data product built on top of the existing foundation inherits the same gaps. Refactoring gets more expensive every month, not less. The window to fix foundational issues cheaply closes quickly after go-live. After that, every fix is a migration, and every migration competes with the delivery work the business is asking for.
The teams that treat operating model gaps as technical debt to be addressed later are making a bet about time that almost never pays off. The teams that treat operating model gaps as blockers to be addressed now are the ones that come out of the next three years with a platform the business trusts.
Every CEO has a GenAI mandate. Every board is asking about it. And yet a July 2025 MIT NANDA study found that 95% of enterprise GenAI pilots delivered no measurable P&L impact, despite $30–40 billion in enterprise spending. The default assumption behind those investments was that the data foundation was ready. It almost never is.
GenAI is the forcing function that makes operating model gaps impossible to hide. Wide tables, missing column descriptions, undocumented lineage, and manual access management all break AI workloads before they break human users.
A business analyst can work with a hundred-column table. They know what they're looking for, they skip the columns that don't matter, and they ignore the fields with unclear definitions. A large language model can't. When an LLM is given a wide table with inconsistent naming and missing column descriptions, it hallucinates. It picks the column that sounds right. It joins on a field that looks like a key and isn't. The output is confident and wrong.
The fix is narrow, purpose-built tables with clean semantics. Column names that describe what they contain. Column descriptions that explain business meaning. Consistent naming across related tables. Clear primary and foreign key relationships. These aren't data engineering niceties. They're the minimum viable inputs for AI that produces trustworthy answers.
Enterprises that spent three years building wide, denormalized operational reporting tables are now discovering that those tables can't be pointed at GenAI directly. They need a second modeling layer, often called a semantic layer, built for AI consumption. That layer takes real work to build. It's a project nobody scoped, running parallel to the existing delivery pressure.
For years, column descriptions were a nice-to-have. Data catalogs had them when a team made the effort. Documentation quality varied by business unit, by team lead, by quarter. The business mostly worked around the gaps.
GenAI changes that math. An LLM answering a business question needs to know what every column means. If the column descriptions are missing, stale, or wrong, the model fills in the gaps with plausible-sounding guesses. The answers come back polished and authoritative. The errors are invisible until a business user acts on a wrong number.
The operating model was supposed to decide that column descriptions are a requirement, not an afterthought. Most operating models didn't. So now the team is writing three years of back-documentation under board pressure, on top of the existing delivery work, for data products that have been live for months.
When a business user asks an LLM "why is our Q3 revenue in the Northeast region down?", the LLM's answer is only as trustworthy as the lineage of the data it's querying. Where did the number come from? What source fed it? What transformations were applied? Which version of the transformation logic was in effect when the number was computed?
Platforms without end-to-end lineage can't answer those questions. The business user doesn't know what to trust. The data team can't validate the AI's output. The GenAI initiative produces answers that are confidently wrong, fails an executive review, and gets shelved.
Lineage is an operating model decision. Platforms that made the decision to capture lineage automatically as part of the build have it. Platforms that deferred lineage to a future governance project don't. And the second category is scrambling.
Gartner predicts that through 2026, organizations will abandon 60% of AI projects unsupported by AI-ready data. In addition, 63% of organizations either don't have or are unsure about having the right data management practices for AI. This is not a theory, it's already arriving on board agendas.
Operational reporting has predictable access patterns. A business analyst queries the tables they've been granted access to. A dashboard uses a service account with a defined permission scope. Everyone knows what's authorized and what isn't.
GenAI workloads don't behave that way. An LLM with access to "the sales data" may try to answer a question by joining across tables that sit in different access tiers. Natural language queries don't respect the access boundaries that were designed for structured SQL. Platforms with manual access assignment and checklist-controlled permissions produce one of two outcomes: AI that can't answer the question because it can't access the data, or AI that answers the question by accessing data it shouldn't have seen.
Both outcomes are failures. The fix is access management that's granular, automated, and enforced by the system. The operating model was supposed to define that. If it didn't, the GenAI initiative is about to expose exactly which data is governed and which data is governed by accident.
The operating model problems that felt tolerable in 2023 are intolerable in 2026. The board isn't giving the data team three years to refactor. They're asking for GenAI pilots in six months and production AI in twelve.
Teams with a defined operating model and a platform that enforces it are shipping those pilots already. They're not scrambling to back-fill documentation, rebuild wide tables into semantic layers, or retrofit access management. The work was done during the build, because the operating model made it part of the build.
Teams without that foundation are rediscovering every gap under deadline pressure. The AI initiative is failing because the foundation underneath it was never ready, and GenAI is the first workload that refuses to work around the foundation's problems.
If your data platform has the symptoms described earlier in this article, your GenAI initiative will surface every one of them. On a timeline the business is about to compress.
Most executives reading this article are not at the start of a data platform project. They're twelve, eighteen, twenty-four months in. The warehouse is live. The framework is in production. The first business unit is using it. The symptoms are real, and the question isn't whether the operating model should have been defined earlier. It's what to do now.
The answer is not to rip everything out. It's also not to accept the current trajectory and hope the next phase of the build compensates for the gaps in the current one. There's a middle path, and it starts with changing what the team is working on, not what it's working with.
Decisions first. Build second. Even mid-project.
The operating model is a finite set of decisions. A working session with the right people in the room can get most of the way through the list in a week. What matters is that the decisions get made deliberately and written down, not that they get made perfectly on the first try.
The decisions that matter most, in order of impact:
Naming conventions. Pick them. Write them down. Validate them automatically in CI. Every future asset conforms. Existing assets get renamed on a defined schedule.
Ownership map. Every data product has a named owner. Every source has a named owner. Every shared model has a named owner. If ownership is unclear, that's the first decision to make, not the last.
Layering semantics. What is raw data? What is a cleaned source? What is a business entity? What is a data product? Four layers, defined crisply, consistent across business units. Not six layers with three teams using them differently.
Access and environment parity. How is access granted? How is it reviewed? What's the refresh cadence for lower environments? Are DEV and PRE_PROD in sync with PROD, and if not, is that a known and accepted limitation or a problem nobody has prioritized?
SLAs. What does the business expect? For a new KPI. For a source onboarding. For a production incident. These get documented. Trade-offs get discussed explicitly instead of assumed.
Cross-team workflows. When the second and third business units onboard, how do they request data products from the central team? How do they extend models the central team owns? How do they avoid duplicating logic that already exists? This is the workflow that scales the platform beyond its first success.
Governance. Not as a future project. As a dimension of every decision already on this list. Ownership, access, naming, and lineage are all governance. If "governance" is still on the roadmap as a separate workstream, it's already too late.
The output of this work is a document. Short, explicit, and owned by a named executive. Not a deck. Not a wiki page. A written operating model that the team can point to when decisions come up, and that the platform can enforce.
The team's energy should go into operating model decisions, not rebuilding Git workflows, CI/CD, and orchestration from scratch.
If the operating model is a finite set of decisions, the infrastructure underneath it is the larger ongoing cost. Git workflows. CI/CD pipelines. Development environments. Secrets management. Orchestration. Deployment standards. Testing frameworks. Every team that builds a serious data platform eventually must build or buy all of it.
Teams that try to build the operating model and the infrastructure at the same time, with the same people, end up doing neither well. The operating model decisions get rushed because infrastructure is urgent. The infrastructure gets built without operating model clarity because decisions haven't been made yet. Both suffer.
The teams that succeed separate the two. The operating model is their work. The infrastructure underneath it is either delegated to a platform that's already built or scoped as a distinct workstream with its own ownership. When the team's meeting time is spent on operating model decisions instead of CI/CD configuration, the operating model gets defined faster, and the infrastructure stays consistent with it.
The hardest part of acting on a missing operating model is knowing where the gaps are. The executive asking, "is our operating model mature?" is usually not close enough to the platform to answer it. The people close enough to answer are often incentivized to say everything is under control.
A small set of diagnostic questions surfaces where the operating model is doing work and where it isn't. Answering them honestly takes an hour. The pattern of answers tells you where to focus first.
On enforcement. Which of your data platform controls are enforced by the system, and which depend on people following a process? If a team member fails to follow the process, does the system stop them, or does the defect reach production?
On ownership. For every data product in your platform, can you name the owner in under thirty seconds? If not, how many orphans are there, and who inherits them when something breaks?
On naming and layering. Can a new engineer look at a table name and know what layer it belongs to, which business unit owns it, and what it contains? If not, how much context do they have to ask for before they can do their job?
On vendor dependency. If the SI or original architect of your platform disengaged tomorrow, could your internal team extend the framework? If not, how much of your roadmap depends on their continued engagement, and what's the cost?
On governance. Is governance a live dimension of every decision, or is it a future project on a slide deck? If it's a future project, how long has it been there?
On GenAI readiness. Could your current platform support a GenAI product that a business user would trust with a strategic decision? If not, what specifically is missing, and how long would it take to build?
On the time window. If you did nothing to change the current trajectory, what does the platform look like in twelve months? If the answer is "worse than today," the operating model work isn't optional.
The operating model is the set of decisions. The platform layer is the system that makes those decisions automatic. Separating the two is how mature data organizations move fast without degrading quality as they scale.
Datacoves exists because most enterprise data teams are spending their time on the wrong layer. They're rebuilding Git workflows, configuring CI/CD, standing up orchestration, wiring secrets management, and writing deployment conventions from scratch, on top of running the business. That work is necessary. It's also not differentiated. Every enterprise data team needs the same underlying platform capabilities, and every team that builds them in-house takes six to twelve months to get there, plus ongoing maintenance that never ends.
Datacoves delivers those capabilities preconfigured, inside the customer's private cloud, running on open-source tools the internal team can own. The operating model decisions still belong to the organization. The infrastructure underneath them is already built.
Git workflows with branching conventions, pull request requirements, and automated validation on every commit. Naming conventions, testing requirements, and documentation expectations get enforced before code merges. A missed convention doesn't reach production because the system doesn't let it.
CI/CD pipelines that run dbt tests, SQL linting, governance checks, and deployment validation automatically. Quality becomes a property of the pipeline itself, regardless of how attentive the reviewer is that morning.
Managed Airflow for orchestration. Pipeline dependencies, retries, failure alerts, and scheduling work consistently across every team. My Airflow for developer testing, Teams Airflow for production. Engineers don't rebuild orchestration conventions for each new project.
In-browser VS Code environments that come up preconfigured with dbt, Python, SQLFluff, Git integration, and every tool the team needs. A new engineer opens their environment on day one and starts writing code. Onboarding time drops from weeks to hours.
Secrets management integrated with the customer's existing vault or AWS Secrets Manager. Credentials never live in code. Access is controlled by the system itself.
Deployment standards that promote code from development through testing to production on the same workflow every time. No manual deployment steps. No scripts that only one person knows how to run.
Governance enforcement at commit time. dbt-checkpoint catches quality issues before they reach the pipeline. SQLFluff keeps SQL consistent. Naming conventions validate in CI. The team doesn't remember the rules because the system enforces them.
Every control listed above is a system-enforced version of a checklist most enterprise platforms maintain manually. The difference in outcomes is structural, not incremental. A platform that enforces these controls automatically produces consistent quality at any team size. A platform that depends on discipline degrades as the team grows.
Datacoves is built around the assumption that the operating model is the customer's work, and the infrastructure that enforces the operating model should be the platform's work. That separation is what lets the customer's team spend its time on decisions that differentiate the business, not on infrastructure that every data team needs and no data team should have to build.
For a team already running on Snowflake with a custom framework or an SI-built platform, Datacoves is the alternative to a second transformation program. Instead of rebuilding the infrastructure layer internally or paying the SI to port new capabilities, the team moves to a platform that already has them. The operating model foundation the team needs to do anyway becomes the focus. The infrastructure underneath it is no longer the team's ongoing cost.
The customers who've made this move describe the outcome the same way: the engineering team stopped maintaining plumbing and started shipping data products. Guitar Center onboarded in days. Johnson and Johnson described it as a framework accelerator. Those aren't luck. They're the result of a platform layer that enforces the operating model by design.
If the symptoms earlier in this article match what you're seeing, the next step is a conversation about where the gaps are and what the platform layer can take off your team's plate. Book a free architecture review. The review surfaces the operating model gaps driving the symptoms the business is already complaining about, and it's the fastest way to see whether the platform layer can shorten the path to the outcomes you expected when you started the build.
A Data Operating Model is the work most enterprises skip because nobody told them it was the work. The tool purchase felt like progress. The SI engagement felt like progress. The first use cases shipping felt like progress. By the time the symptoms surfaced, the decisions that would have prevented them had been deferred long enough to become expensive.
The executives who get this right aren't smarter than the ones who don't. They're just earlier. They define the operating model before the build starts, or they stop the build long enough to define it once they realize it was never decided. The teams that do that work once ship data products for years afterward. The teams that don't spend those same years compensating for decisions that were never made.
If the symptoms in this article match what you're seeing in your own platform, the message is simple. The tools aren't failing you. The operating model underneath them is, and it will keep failing until somebody decides to define it. That work is smaller than it looks, it's faster to do than to defer, and it's the only path to the outcomes the business was expecting when the project started.
Your team has spent eighteen months proving they can build. The next eighteen months are going to be about whether the business trusts what got built. That outcome is decided at the operating model layer, not at the tool layer. The sooner leadership treats it that way, the sooner the symptoms stop.

Datacoves enables Snowflake customers to deploy secure, end-to-end data engineering environments with dbt, Airflow, and modern DevOps best practices
Datacoves is expanding its integration with the Snowflake AI Data Cloud, giving Snowflake customers a secure, end-to-end data engineering environment with dbt, Airflow, and modern DevOps best practices, all running inside their own cloud.
This means Snowflake teams get a consistent foundation for development, orchestration, testing, CI/CD, and observability without moving data outside their environment or introducing new security risks.
“Snowflake is the analytical backbone for many of the world’s most data-driven organizations. Datacoves gives those teams a secure and opinionated platform to run modern data engineering practices on top of Snowflake, without forcing them into rigid SaaS tools or DIY infrastructure.”
— Noel Gomez, Co-Founder of Datacoves
Organizations using Snowflake can standardize how teams develop and operate analytics workflows using dbt, Airflow, Python, and Git-based workflows while maintaining full control over identity, access, logging, and infrastructure.
Datacoves is commonly used by large enterprises running Snowflake in regulated and complex environments, including life sciences, consumer goods, and financial services. These organizations require private deployment, operational flexibility, and strong engineering foundations.
Datacoves already supports the Snowflake Cortex CLI (CoCo) inside the in-browser VS Code environment with zero setup required for end users. Snowflake credentials are automatically configured through the existing Snowflake extension, so developers can start using CoCo immediately.
The Cortex CLI works like Claude Code but runs on Snowflake’s infrastructure. Within Datacoves, developers can use it to:
• Query and explore Snowflake data directly from the terminal
• Generate Python scripts that interact with external APIs and services
• Find tables, columns, and schema details across Snowflake databases
• Accelerate development for both Snowflake-specific and general-purpose tasks
Because Datacoves provides a standardized, preconfigured development environment, there’s no installation guesswork. CoCo picks up existing credentials and connections automatically.
Datacoves also maintains Snowcap, an open-source, Snowflake-native infrastructure-as-code tool built from deep experience managing Snowflake at scale.
Snowcap uses YAML or Python configuration, requires no state file, and supports over 60 Snowflake resource types. It includes opinionated accelerators for some of the most complex areas of Snowflake administration:
• Role-Based Access Control (RBAC) for managing permissions at scale across teams, projects, and environments
• Tag-Based Masking Policies for applying dynamic data masking consistently across sensitive columns
• Row Access Policies for controlling row-level security with auditable, version-controlled configurations
These are areas where manual Snowflake administration breaks down fast, especially as teams and data grow. Snowcap brings software engineering discipline to Snowflake governance with CI/CD integration through GitHub Actions.
Snowflake handles storage and compute. Datacoves provides the engineering layer that sits on top: managed dbt, managed Airflow, CI/CD, governance, and best practices. Together, they give enterprise teams a complete, production-ready data engineering environment deployed in weeks.
Teams eliminate fragmented environments, inconsistent workflows, and manual platform maintenance. The result is faster onboarding, clearer ownership, and improved visibility across ingestion, transformation, orchestration, and deployment.
To learn more about how Datacoves supports Snowflake teams, book a free architecture review or visit datacoves.com/snowflake.
About Datacoves
Datacoves is an end-to-end data engineering platform that helps organizations deliver secure, high-quality data products with speed and confidence. Deployed inside a customer’s own cloud and enterprise network, Datacoves provides a unified environment for development, orchestration, testing, CI/CD, and observability. It delivers a managed platform for dbt, Airflow, and Python without vendor lock-in.
Snowflake is one of the best data warehouses available. But buying it doesn't give you a data platform. A working platform also requires an engineering environment where your team can develop consistently, orchestration to run and monitor pipelines, CI/CD to enforce quality before anything reaches production, and ways of working that make the whole thing maintainable as your team grows. Most Snowflake implementations deliver the warehouse. The platform layer around it, and the practices underneath it, are usually left for your team to figure out after the SI rolls off. That gap is where most implementations quietly fail.
Buying Snowflake gives you a warehouse. A working data platform requires an engineering environment, orchestration, CI/CD, and ways of working that don't come with the warehouse contract.
A data warehouse stores and processes data. That's what it was designed to do, and Snowflake does it exceptionally well.
A data platform does something different. It's the environment where your team develops, tests, deploys, and monitors data products. It includes the tools, the conventions, and the ways of working that determine whether your data is trustworthy, usable, and maintainable at scale.
The distinction matters because most implementations are scoped around the warehouse. The platform layer gets treated as something that will sort itself out later. It rarely does.
Think about it in two layers.
The first is what your users experience: whether they trust the data, whether they can find and understand it, and whether business and technical teams can communicate around it. This includes trustworthiness, usability, collaboration.
The second is what makes those outcomes possible at the platform level: whether data products can be reused without rebuilding from scratch, whether the system is maintainable when people leave or the team grows, and whether pipelines are reliable enough that failures get caught early instead of surfacing in a meeting. These include reusability, maintainability, reliability.
Most Snowflake implementations deliver storage and compute. The six outcomes above are what your business expected the platform to produce. They require deliberate work that sits outside the warehouse contract.

Snowflake is excellent at what it does. Fast queries, elastic scaling, clean separation of storage and compute, a strong security model. If your previous warehouse was on-prem or running on aging infrastructure, the difference is real and immediate.
The problem isn't Snowflake. The expectation that the warehouse is the platform is.
Snowflake handles storage, compute, and access control. It doesn't give your team a development environment. It doesn't orchestrate your pipelines or tell you when one failed and why. It doesn't enforce naming conventions, testing standards, or deployment rules. It doesn't document your data models or make them understandable to a business analyst who didn't build them. It doesn't define how your team reviews code, manages branches, or promotes changes from development to production.
Those things aren't gaps in Snowflake's product. They were never Snowflake's job.
But when leaders evaluate a warehouse and sign a contract, the scope of what they're buying rarely gets articulated clearly. The demos show fast queries and a clean UI. The pitch covers performance benchmarks and cost savings versus the legacy system. Nobody walks through the engineering environment your team will need to build on top of it, because that's not what the vendor is selling.
So teams buy a best-in-class warehouse and then spend the next six months discovering everything else they need. Some figure it out. Some don't. And most take a long time to get there.
There are three common paths to a Snowflake implementation. Each one has real strengths. Each one has a predictable blind spot that leads to the same outcome: a warehouse that works, but fails to deliver the expected results.

Snowflake's marketing is good. That's not a criticism, it's an observation. The positioning is clear, the case studies are compelling, and the product genuinely delivers on the core promise.
What the marketing doesn't cover is everything that sits around the warehouse. That's not Snowflake's job. Their job is to sell Snowflake. The implicit message, though, is that the hard problem is the warehouse. Once that's solved, everything else follows.
It doesn't. Leaders who build their implementation strategy around the vendor pitch tend to underscope the project from the start. The warehouse gets stood up on time and on budget. The data engineering environment, the orchestration layer, the governance foundation, those get deferred. Sometimes indefinitely.
Every organization has at least one person who comes back from a Snowflake conference ready to modernize everything. That enthusiasm is valuable. It's also frequently mis-channeled.
Internal champions know the business problem well. They've seen the pain. What they often don't have is deep experience building and operating a production data platform from scratch. They know what good outcomes look like. They haven't necessarily seen what a well-built foundation looks like underneath those outcomes.
So the implementation gets shaped around what they know: the warehouse, the transformation tool, maybe a basic orchestration setup. The harder questions around developer environments, CI/CD, testing standards, secrets management, and deployment conventions don't get asked because nobody in the room has been burned by skipping them before.
A migration is not a platform implementation. The SI's job is to get your data into Snowflake. Whether the environment your team inherits is maintainable and built on sound engineering practices is usually outside the engagement scope.
System integrators are good at migrations. Moving data from point A to point B, replicating existing logic in a new tool, hitting a go-live date. That's what most of them are scoped and incentivized to deliver.
It's not that SIs cut corners. It's that "build a production-grade data engineering platform with sustainable ways of working" wasn't in the statement of work.
What gets handed off is a warehouse with some tables, some transformation logic, and documentation that will be out of date within a month. The team that inherits it then spends the next year figuring out how to operate it at scale.
If you're evaluating implementation partners, here's what to look for before you sign.
When the implementation is scoped around the warehouse and the migration, a predictable set of things gets deferred. Not because anyone decided they didn't matter, but because they weren't on the project plan.
Here's what that looks like in practice six to twelve months later.
Snowflake costs start climbing. Without well-structured data models, query optimization standards, and sensible clustering strategies, warehouses burn credits fast. Teams that skipped the engineering foundation often spend the first year optimizing for cost rather than delivering new capabilities. The savings from migrating off the legacy system quietly get absorbed by an inefficient Snowflake setup.
Business users don't trust the data. When there are no testing standards, no documentation conventions, and no consistent naming across models, analysts spend more time validating numbers than using them. The platform gets a reputation for being unreliable. People go back to Excel because nobody built the layer that makes data understandable and trustworthy.
The team can't move fast. Without CI/CD pipelines, code reviews, and deployment guardrails, every change is a risk. Engineers slow down because they're afraid of breaking something. Onboarding a new team member takes weeks because the knowledge lives in people's heads, not in the system.
Pipelines break in ways nobody sees coming. Without orchestration that handles dependencies, retries, and failure alerts, pipeline failures surface downstream. A business user notices the numbers are wrong before the data team does. That erodes trust fast and is hard to rebuild.
The foundation debt compounds. Every week that passes without fixing the underlying structure makes it harder to fix. New models get built on top of a shaky base. Refactoring becomes expensive. The team that was supposed to be delivering new data products spends its time maintaining what already exists.
This is the real cost of the quick win approach. Six months of fast progress followed by years of slow, careful, expensive work to undo the shortcuts.
We've documented what that looks like in practice here.
Most implementation conversations focus on the tool stack. Which warehouse, which transformation framework, which orchestrator. Those are real decisions and they matter.
But the teams that deliver reliable data products consistently aren't just using the right tools. They're using them the same way across every engineer on the team.
That's the ways of working problem. And it's the part nobody puts in the project plan.
A team with Snowflake and dbt but no agreed branching strategy, no code review process, no testing standards, and no deployment conventions is still fragile. One engineer builds models one way. Another builds them differently. A third inherits both and must figure out which approach is "correct" before they can extend anything. The system never enforced a consistent approach.
The same applies to orchestration. Airflow is powerful. An Airflow environment where every engineer writes DAGs differently, secrets are managed inconsistently, and there's no standard for how pipeline failures get handled is not an asset. It's a maintenance problem waiting to get worse.
Good data engineering is a thought-out combination of tools and conventions that work together. The conventions are what make the tools scale beyond the person who set them up.
This is why the two-layer framework matters in practice. Trustworthiness, usability, and collaboration aren't outcomes you get from buying the right tools. They're outcomes you get when the platform layer underneath, the reusability, maintainability, and reliability, is built deliberately. With both the right tooling and the right ways of working enforced by the system itself, not by people remembering to follow a document.
The teams that figure this out usually do it the hard way. They run into the problems first, then back into the conventions that would have prevented them. That process can take years and a lot of frustration. Getting the ways of working right from the start compresses that timeline significantly.

The teams that move fastest twelve months in are almost always the ones who slowed down at the start.
The most common objection to investing in the foundation is time. Leaders have stakeholders who want results. Boards want dashboards. The business wants answers. Spending eight weeks building an engineering environment and establishing conventions feels like the opposite of moving fast.
That instinct is understandable. It's also wrong.
The teams that move fastest twelve months in are almost always the ones who slowed down at the start. Not forever. For a few weeks. Long enough to get the development environment right, establish the conventions, wire up CI/CD, and make sure the orchestration layer is solid before anyone builds on top of it.
The teams that skipped that work aren't moving fast. They're managing debt. Every new model gets built carefully because nobody is sure what it might break. Every pipeline change requires manual testing because the automated checks were never put in place. Every new hire takes weeks to get productive because the knowledge lives in people, not in the system.
A quick start that skips the foundation isn't free. It's a loan at a high interest rate. The payments start small and get larger every month.
The same logic applies here. A quick start that skips the foundation isn't free. It's a loan at a high interest rate. The payments start small and get larger every month.
Getting the foundation right upfront doesn't mean months of invisible infrastructure work before anyone sees results. Done well, it takes weeks, not quarters. And what you get on the other side is a team that ships twice a week without being afraid of what they might break, data that business users trust, and a platform that gets easier to extend as it grows rather than harder.
That's not slow. That's the fast path.
Before you sign with anyone, there's a specific set of questions worth asking your SI or platform vendor. We covered them in detail here.
Most teams face a choice at the start of a data platform project. Build the foundation properly and accept that it takes time. Or skip it and move fast now, knowing you'll pay for it later.
Datacoves is built around the idea that you shouldn't have to make that trade-off.
It's an enterprise data engineering platform that runs inside your private cloud and comes with the foundation pre-built. Managed dbt and Airflow, a VS Code development environment your engineers can open on day one, CI/CD pipelines that enforce quality before anything reaches production, and an architecture built on best practices that your team inherits rather than invents.
The conventions, the guardrails, the deployment workflows, the secrets management, the testing framework. None of that gets figured out after the fact. It's already there.
That's what compresses the timeline. Not shortcuts. Not skipping steps. The foundation work is done, and your team starts from a position that most organizations spend a year trying to reach on their own.
The result is a team that ships consistently from early on, data that business users trust because quality is enforced by the system rather than by people remembering to check, and a platform that gets easier to extend as it grows.
Guitar Center onboarded in days. Johnson and Johnson described it as a framework accelerator. Those outcomes aren't the result of moving fast and fixing problems later. They're the result of starting with a foundation that didn't need to be fixed.
Snowflake is a great warehouse. The teams that get the most out of it aren't the ones who bought it and figured out the rest later. They're the ones who treated the platform layer as part of the project from the start. The tool doesn't build the platform. That part is still your decision to make.
Most enterprise data platform projects don't fail because of the tools. They fail because nobody in the room, not IT, not the consulting firm, not the vendor, had ever seen what a well-built platform actually looks like. So decisions get made by people delegating to other people who are also delegating. Enterprise standards get followed without being questioned. And two years later, you have a platform that works well enough to demo, moves too slowly to trust, and can only be changed by the team that built it. That's a predictable outcome of a broken process.
Most enterprise data platform projects don't fail because of the tools. They fail because nobody in the room had ever seen what a well-built platform looks like.
This article explains why it happens, what the warning signs look like from the outside, and what you should demand before you hand the keys to anyone again.
Two years ago, you had a problem. Your data was slow to get to, hard to trust, and impossible to act on. You brought it to IT. IT brought it to a consulting firm with the right logos and the right industry experience. Contracts were signed. A roadmap was presented. Everyone nodded.

What nobody said out loud is that each handoff in that chain came with a knowledge ceiling. You knew the business problem but not the technical solution. IT knew their enterprise standards but not necessarily what best-in-class data engineering looks like. The consulting firm knew their framework, the one they'd been building for years, the one that was genuinely good when they built it. Whether it was still the right answer for where the industry is today was a question nobody asked, because nobody in the room had the reference point to ask it.
This isn't about blame. The system worked exactly as designed. It just wasn't designed to produce what you actually needed.
You don't need to understand the architecture to recognize that something is wrong. You've seen it in other ways.
Reports that take three sprints to change. Business users have stopped asking for new metrics because the process is too slow. Analysts who still use Excel because the data in the platform doesn't match what they expect. A go-live that keeps moving. Issues that surface in final testing, never earlier. A data team that spends more time firefighting than building.

These aren't symptoms of a tool problem. They're symptoms of a platform that was built to pass a demo, not to operate at scale. The foundation looked solid in the PowerPoint. Then one corner started sinking. Someone ran outside, propped it up, called it good enough, and everyone went back inside to talk about the drapes. Until the next corner started sinking.
The business lives with the consequences of decisions that were never explained to them. That's the part that should make you uncomfortable. Not because anyone was careless, but because the people making the technical decisions had never been asked to connect them back to your actual business problem.
Consulting firms sell services. The migration is the engagement. They have every incentive to deliver something that looks complete at handoff, and very little incentive to think about what happens eighteen months later when your team needs to extend it, upgrade it, or adapt it to something they didn't anticipate. If you're evaluating migration partners, here's what to look for.
That's just the business model. A platform built on open-source tools with documented conventions, enforced by the system rather than by people remembering to follow a process, doesn't generate the same ongoing billing as one that requires the original team to come back every time something needs to change. Someone who built one of these frameworks told me it was "good three years ago." That's a telling phrase. Good three years ago means it was already aging when it was delivered.
dbt, Airflow, SQLFluff, and the open-source ecosystem around them move fast because there are thousands of contributors and companies whose entire existence depends on making them better. A proprietary consulting framework moves at the pace of whoever owns it, if it moves at all.
The gap compounds. New Airflow capabilities require the consulting firm to schedule an upgrade. Framework changes require them to build it. Every month that passes, the distance between what's possible and what your team can actually access gets wider. Your engineers know it. They read the release notes. They see what other teams are doing. They just can't do it themselves because the platform wasn't built to be owned by them.
The teams that don't have this problem didn't get lucky. They built on a foundation that was designed to evolve, with tools maintained by communities that have no interest in keeping anyone dependent.
Every business leader has a GenAI mandate right now. Boards want it. CEOs are asking about it. And the instinct is to treat it as a separate initiative, something you layer on top of what you already have.
That's the mistake.
Think about your power grid. Nobody questions whether the lights will turn on when they flip the switch. That reliability exists because someone built generation, transmission, and distribution correctly before anyone thought about what to plug in. You're being asked to build a smart home on a grid that browns out every Tuesday.

AI is only as reliable as the data underneath it. If your data has no lineage, inconsistent definitions, and no documentation, your AI initiative will produce outputs nobody trusts, at scale, faster than you've ever produced untrustworthy outputs before.
According to Gartner, organizations will abandon 60% of AI projects unsupported by AI-ready data by 2026.
The data quality problems that were always there don't disappear under a GenAI layer. They get amplified and made visible in the worst possible moment, in front of the stakeholders you most need to impress.
The foundation isn't a data engineering problem. It's the prerequisite to the most important initiative on your roadmap. Getting it right isn't IT's job to delegate. It's your decision to make.

A well-built platform doesn't rely on people following the process correctly every time. It makes the wrong action difficult.
The difference between a platform that scales and one that doesn't usually isn't visible in a demo. It shows up six months after go-live, when the team doubles, when a consultant rolls off, when someone new joins, and has to figure out how things work by reading documentation that may or may not exist.
A well-built platform doesn't rely on people following the process correctly every time. It makes the wrong action difficult. Automated checks run before anything can merge, so code quality doesn't depend on a reviewer having a good day. Naming conventions live in the tooling itself, not in a document that gets read once during onboarding. Pipeline dependencies are declared explicitly and validated automatically, so a missed configuration doesn't quietly cause a downstream failure three sprints later. A new engineer who joins the team inherits those guardrails on day one without reading a document or asking anyone how things work. The knowledge is in the system, not in the heads of the people who built it.
Good data engineering is a thought-out combination of tools that work well together, each maintained by a community that cares about making it better. The people improving dbt, Airflow, and SQLFluff aren't doing it on a consulting firm's project timeline. They're doing it because their companies depend on those tools getting better continuously.
dbt-coves automates the tedious parts of dbt development. dbt-checkpoint enforces data quality standards at commit time, before bad code ever reaches a pipeline. SQLFluff keeps SQL consistent across every engineer on the team. Snowcap manages Snowflake infrastructure as code, the thing that's perpetually "on the roadmap" in consulting-built platforms but never quite arrives.
None of these tools is the answer on its own. The answer is that someone chose them deliberately, integrated them carefully, and built a platform where they work together. That's the difference between a workshop that was designed and one that just accumulated tools over time.
It also means your platform can adapt. When a better approach emerges, you adopt it. Your engineers can use Claude Code, GitHub Copilot, Snowflake Cortex, or any other AI tool that fits their workflow. They're not waiting for a vendor to build an integration or a consulting firm to schedule a framework update. Some platforms give you one AI tool and call it done. That's a product decision masquerading as a strategy.
If you're about to start a data platform engagement with a consulting firm, a platform vendor, or anyone else, these are the questions worth asking before you sign.
Can you show me a client whose team ships on their own cadence, without your firm in the critical path for day-to-day decisions? Managing infrastructure is a legitimate service. What you want to avoid is a vendor who ends up managing your decisions, your roadmap, and your team's ability to move without them. That relationship tends to get more expensive over time, not less.
What open-source tools does your framework use, and how do you handle upgrades when new versions are released? A proprietary framework that wraps open-source tools is only as current as whoever maintains the wrapper. Find out who that is and how often it happens.
What happens if a business user needs a new metric? Walk me through the process end-to-end, including who approves it, how long it takes, and what the business user can do themselves. The answer will tell you whether they're building you an industrial kitchen or planning to cook everything for you. Business users are smart enough to use a knife. A platform built on that assumption looks very different from one built on the assumption that they aren't.
Who on our team will own the business logic, the models, and the deployment process? Not the infrastructure, the work. A good partner manages the complexity underneath so your engineers can focus on delivering. If the answer to ownership is vague, the engagement was designed around their continuity, not yours.
How do you handle the Snowflake security model, and can you show me examples? This one is worth asking even if you don't fully understand the answer. Pay attention to whether they answer with confidence and specificity, or whether they say "this came directly from Snowflake" as if that settles it. It doesn't. A recommendation from a Snowflake account team is a starting point, not an architecture.
Most leaders assume that doing things right means slowing down. That getting the foundation in place before starting to deliver means months of invisible work before anyone sees results. That's not how this works.

The Datacoves Foundation's engagement takes less than two months.
In that time, you get:
The goal is to move slowly to move fast. Not slow forever. Slow for eight weeks, so your team isn't spending the next three years propping up corners. Skip the foundation, and you move fast for the first six months, then spend the next three years explaining why everything takes so long. Get it right and your team ships twice a week without being afraid of what they might break.
Guitar Center onboarded in days, not months. Johnson and Johnson described it as a framework accelerator. A team at DataDrive saved over 200 hours annually by replacing a fragile self-built pipeline with something that actually held up at scale. None of them spent 18 months waiting to find out if it would work.
Without a solid foundation, your GenAI initiative will surface data problems at the worst possible moment. Your business users will keep working around the platform instead of in it. And your engineers will keep moving carefully instead of moving fast.
You shouldn't have to build this from scratch. Most teams already paid someone to do it. They just didn't get it.

dbt (data build tool) is a SQL-based transformation framework that turns raw data into trusted, analytics-ready datasets directly inside your data warehouse. It brings software engineering discipline to analytics: version control, automated testing, CI/CD, and auto-generated documentation. dbt handles the "T" in ELT. It does not extract, load, or move data.
dbt focuses exclusively on the transformation layer of ELT (Extract, Load, Transform). Unlike traditional ETL tools that handle the entire pipeline, dbt assumes data already exists in your warehouse. Ingestion tools like Informatica, Azure Data Factory, or Fivetran load the raw data. dbt transforms it into trusted, analytics-ready datasets.
A dbt project consists of SQL files called models. Each model is a SELECT statement that defines a transformation. When you run dbt, it compiles these models, resolves dependencies, and executes the SQL directly in your warehouse. The results materialize as tables or views. Data never leaves your warehouse.
Example: A Simple dbt Model (models/marts/orders_summary.sql)
SELECT
customer_id,
COUNT(*) AS total_orders,
SUM(order_amount) AS lifetime_value,
MIN(order_date) AS first_order_date
FROM {{ ref('stg_orders') }}
GROUP BY customer_idThe {{ref('stg_orders')}} syntax creates an explicit dependency. dbt uses these references to build a dependency graph (DAG) of your entire pipeline, ensuring models run in the correct order.

For large datasets, dbt supports incremental models that process only new or changed data. This keeps pipelines fast and warehouse costs controlled as data volumes grow.
With dbt, teams can:
dbt handles the "T" in ELT. It does not extract, load, or move data between systems.

Misaligned expectations are a primary cause of failed dbt implementations. Knowing what dbt does not do matters as much as knowing what it does.
This separation of concerns is intentional. By focusing exclusively on transformation, dbt allows enterprises to evolve their ingestion, orchestration, and visualization layers independently. You can swap Informatica for Azure Data Factory or migrate from Redshift to Snowflake without rewriting your business logic.
dbt is a tool, not a strategy. Organizations with unclear data ownership, no governance framework, or misaligned incentives will not solve those problems by adopting dbt. They will simply have the same problems with versioned SQL.
For a deeper comparison, see dbt vs Airflow: Which data tool is best for your organization?
Over 30,000+ companies use dbt weekly, including JetBlue, HubSpot, Roche, J&J, Block, and Nasdaq dbt Labs, 2024 State of Analytics Engineering
Enterprise adoption of dbt has accelerated because it solves problems that emerge specifically at scale. Small teams can manage transformation logic in spreadsheets and ad hoc scripts. At enterprise scale, that approach creates compounding risk.
dbt has moved well beyond startups into regulated, enterprise environments:
Life Sciences: Roche, Johnson & Johnson (See how J&J modernized their data stack with dbt), and pharmaceutical companies with strict compliance requirements
These are not proof-of-concept deployments. These are production systems powering executive dashboards, regulatory reporting, and customer-facing analytics.
Without a standardized transformation layer, enterprise analytics fails in predictable ways:
Organizations report 45% of analyst time is spent finding, understanding, and fixing data quality issues Gartner Data Quality Market Survey, 2023
dbt addresses these problems by treating transformation logic as production code:

One of the most underappreciated reasons enterprises adopt dbt is leverage. dbt is not just a transformation framework. It sits at the center of a broad ecosystem that reduces implementation risk and accelerates delivery.
dbt packages are reusable projects available at hub.getdbt.com. They provide pre-built tests, macros, and modeling patterns that let teams leverage proven approaches instead of building from scratch.
Popular packages include:
Using packages signals operational maturity. It reflects a preference for shared, tested patterns over bespoke solutions that create maintenance burden. Mature organizations also create internal packages they can share across teams to leverage learnings across the company.
dbt integrates with the broader data stack through its rich metadata (lineage, tests, documentation):
Because dbt produces machine-readable metadata, it acts as a foundation that other tools build on. This makes dbt a natural anchor point for enterprise data platforms.
The dbt Slack community has 100,000+ members sharing patterns, answering questions, and debugging issues dbt Labs Community Stats, 2024
For enterprises, community size matters because:
When you adopt dbt, you are not just adopting a tool. You are joining an ecosystem with momentum.
A typical dbt workflow follows software engineering practices familiar to any developer:
models:
- name: orders_summary
description: "Customer-level order aggregations"
columns:
- name: customer_id
description: "Primary key from source system"
tests:
- unique
- not_null
- name: lifetime_value
description: "Sum of all order amounts in USD" For executives and data leaders, dbt is less about SQL syntax and more about risk reduction and operational efficiency.
Organizations implementing dbt with proper DataOps practices report:
dbt supports enterprise governance requirements by making transformations explicit and auditable:
The question for enterprise leaders is not "Should we use dbt?" The question is "How do we operate dbt as production infrastructure?"
dbt Core is open source, and many teams start by running it on a laptop. But open source looks free the way a free puppy looks free. The cost is not in the acquisition. The cost is in the care and feeding.
For a detailed comparison, see Build vs Buy Analytics Platform: Hosting Open-Source Tools.
The hard part is not installing dbt. The complexity comes from everything around it:
Building your own dbt platform is like wiring your own home: possible, but very few teams should. Most enterprises find that building and maintaining this infrastructure becomes a distraction from their core mission of delivering data products.
dbt delivers value when supported by clear architecture, testing standards, CI/CD automation, and a platform that enables teams to work safely at scale.
Skip the Infrastructure. Start Delivering.
Datacoves provides managed dbt and Airflow deployed in your private cloud, with pre-built CI/CD, VS Code environments, and best-practice architecture out of the box. Your data never leaves your network. No VPC peering required.
Learn more about Managed dbt + Airflow

Before adopting or expanding dbt, leaders should ask:
Is your transformation logic auditable? If business rules live in dashboards, stored procedures, or tribal knowledge, the answer is no. dbt makes every transformation visible, version-controlled, and traceable.
Do your teams define metrics the same way? If "revenue" or "active user" means different things to different teams, you have metric drift. dbt centralizes definitions in code so everyone works from a single source of truth.
Where do you find data quality issues? If problems surface in executive dashboards instead of daily data quality check, you lack automated testing. dbt runs tests on every build, catching issues before they reach end users.
How long does onboarding take? If new analysts spend weeks decoding tribal knowledge, your codebase is not self-documenting. dbt generates documentation and lineage automatically from code.
Who owns your infrastructure? Decide whether your engineers should be building platforms or building models. Operating dbt at scale requires CI/CD, orchestration, environments, and security. That work must live somewhere.
Can you trace how a number was calculated? If auditors or regulators ask how a reported figure was derived, you need full lineage from source to dashboard. dbt provides that traceability by design.
dbt has become the standard for enterprise data transformation because it makes business logic visible, testable, and auditable. But the tool alone is not the strategy. Organizations that treat dbt as production infrastructure, with proper orchestration, CI/CD, and governance, unlock its full value. Those who skip the foundation often find themselves rebuilding later.
Ready to skip the infrastructure complexity? See how Datacoves helps enterprises operate dbt at scale

A lean analytics stack built with dlt, DuckDB, DuckLake, and dbt delivers fast insights without the cost or complexity of a traditional cloud data warehouse. For teams prioritizing speed, simplicity, and control, this architecture provides a practical path from raw data to production-ready analytics.
In practice, teams run this stack using Datacoves to standardize environments, manage workflows, and apply production guardrails without adding operational overhead.
A lean analytics stack built with dlt, DuckDB, DuckLake, and dbt delivers fast, production-ready insights without the cost or complexity of a traditional cloud data warehouse.
A lean analytics stack works when each tool has a clear responsibility. In this architecture, ingestion, storage, and transformation are intentionally separated so the system stays fast, simple, and flexible.
Together, these tools form a modern lakehouse-style stack without the operational cost of a traditional cloud data warehouse.
Running DuckDB locally is easy. Running it consistently across machines, environments, and teams is not. This is where MotherDuck matters.
MotherDuck provides a managed control plane for DuckDB and DuckLake, handling authentication, metadata coordination, and cloud-backed storage without changing how DuckDB works. You still query DuckDB. You just stop worrying about where it runs.
To get started:
MOTHERDUCK_TOKEN).This single token is used by dlt, DuckDB, and dbt to authenticate securely with MotherDuck. No additional credentials or service accounts are required.
At this point, you have:
That consistency is what makes the rest of the stack reliable.
In a lean data stack, ingestion should be reliable, repeatable, and boring. That is exactly what dlt is designed to do.
dlt loads raw data into DuckDB with strong defaults for schema handling, incremental loads, and metadata tracking. It removes the need for custom ingestion frameworks while remaining flexible enough for real-world data sources.
In this example, dlt ingests a CSV file and loads it into a DuckDB database hosted in MotherDuck. The same pattern works for APIs, databases, and file-based sources.
To keep dependencies lightweight and avoid manual environment setup, we use uv to run the ingestion script with inline dependencies.
pip install uv
touch us_populations.py
chmod +x us_populations.pyThe script below uses dlt’s MotherDuck destination. Authentication is handled through the MOTHERDUCK_TOKEN environment variable, and data is written to a raw schema in DuckDB.
#!/usr/bin/env -S uv run
# /// script
# dependencies = [
# "dlt[motherduck]==1.16.0",
# "psutil",
# "pandas",
# "duckdb==1.3.0"
# ]
# ///
"""Loads a CSV file to MotherDuck"""
import dlt
import pandas as pd
from utils.datacoves_utils import pipelines_dir
@dlt.resource(write_disposition="replace")
def us_population():
url = "https://raw.githubusercontent.com/dataprofessor/dashboard-v3/master/data/us-population-2010-2019.csv"
df = pd.read_csv(url)
yield df
@dlt.source
def us_population_source():
return [us_population()]
if __name__ == "__main__":
# Configure MotherDuck destination with explicit credentials
motherduck_destination = dlt.destinations.motherduck(
destination_name="motherduck",
credentials={
"database": "raw",
"motherduck_token": dlt.secrets.get("MOTHERDUCK_TOKEN")
}
)
pipeline = dlt.pipeline(
progress = "log",
pipeline_name = "us_population_data",
destination = motherduck_destination,
pipelines_dir = pipelines_dir,
# dataset_name is the target schema name in the "raw" database
dataset_name="us_population"
)
load_info = pipeline.run([
us_population_source()
])
print(load_info)Running the script loads the data into DuckDB:
./us_populations.pyAt this point, raw data is available in DuckDB and ready for transformation. Ingestion is fully automated, reproducible, and versionable, without introducing a separate ingestion platform.
Once raw data is loaded into DuckDB, transformations should follow the same disciplined workflow teams already use elsewhere. This is where dbt fits naturally.
dbt provides version-controlled models, testing, documentation, and repeatable builds. The difference in this stack is not how dbt works, but where tables are materialized.
By enabling DuckLake, dbt materializes tables as Parquet files with centralized metadata instead of opaque DuckDB-only files. This turns DuckDB into a true lakehouse engine while keeping the developer experience unchanged.
To get started, install dbt and the DuckDB adapter:
pip install dbt-core==1.10.17
pip install dbt-duckdb==1.10.0
dbt initNext, configure your dbt profile to target DuckLake through MotherDuck:
default:
outputs:
dev:
type: duckdb
# This requires the environment var MOTHERDUCK_TOKEN to be set
path: 'md:datacoves_ducklake'
threads: 4
schema: dev # this will be the prefix used in the duckdb schema
is_ducklake: true
target: devThis configuration does a few important things:
MOTHERDUCK_TOKEN environment variableWith this in place, dbt models behave exactly as expected. Models materialized as tables are stored in DuckLake, while views and ephemeral models remain lightweight and fast.
From here, teams can:
This is the key advantage of the stack: modern analytics engineering practices, without the overhead of a traditional warehouse.
This lean stack is not trying to replace every enterprise data warehouse. It is designed for teams that value speed, simplicity, and cost control over heavyweight infrastructure.
This approach works especially well when:
The trade-offs are real and intentional. DuckDB and DuckLake excel at analytical workloads and developer productivity, but they are not designed for high-concurrency BI at massive scale. Teams with hundreds of dashboards and thousands of daily users may still need a traditional warehouse.
Where this stack shines is time to value. You can move from raw data to trusted analytics quickly, with minimal infrastructure, and without locking yourself into a platform that is expensive to unwind later.
In practice, many teams use this architecture as:
When paired with Datacoves, teams get the operational guardrails this stack needs to run reliably. Datacoves standardizes environments, integrates orchestration and CI/CD, and applies best practices so the simplicity of the stack does not turn into fragility over time.
Teams often run this stack with Datacoves to standardize environments, apply production guardrails, and avoid the operational drag of DIY platform management.
If you want to see this stack running end to end, watch the Datacoves + MotherDuck webinar. It walks through ingestion with dlt, transformations with dbt and DuckLake, and how teams operationalize the workflow with orchestration and governance.
The session also covers:
The merger of dbt Labs and Fivetran (which we refer to as dbt Fivetran for simplicity) represents a new era in enterprise analytics. The combined company is expected to create a streamlined, end-to-end data workflow consolidating data ingestion, transformation, and activation with the stated goal of reducing operational overhead and accelerating delivery. Yet, at the dbt Coalesce conference in October 2025 and in ongoing conversations with data leaders, many are voicing concerns about price uncertainty, reduced flexibility, and the long-term future of dbt Core.
As enterprises evaluate the implications of this merger, understanding both the opportunities and risks is critical for making informed decisions about their organization's long-term analytics strategy.
In this article, you’ll learn:
1. What benefits could the dbt Fivetran merger offer enterprise data teams
2. Key risks and lessons from past open-source acquisitions
3. How enterprises can manage risks and challenges
4. Practical steps dbt Fivetran can take to address community anxiety

For enterprise data teams, the dbt Fivetran merger may bring compelling opportunities:
1. Integrated Analytics Stack:
The combination of ingestion, transformation, and activation (reverse ETL) processes may enhance onboarding by streamlining contract management, security evaluations, and user training.
2. Resource Investment:
The merged company has the potential to speed up feature development across the data landscape. Open data standards like Iceberg could see increased adoption, fostering interoperability between platforms such as Snowflake and Databricks.
While these prospects are enticing, they are not guaranteed. The newly formed organization now faces the non-trivial task of merging various teams, including Fivetran, HVR (Oct 2021), Census (May 2025), SQLMesh/Tobiko (Sept 2025), and dbt Labs (Oct 2025). Successfully integrating their tools, development practices, and support functions will be crucial. To create a truly seamless, end-to-end platform, alignment of product roadmaps, engineering standards, and operational processes will be necessary. Enterprises should carefully assess the execution risks when considering the promised benefits of this merger, as these advantages hinge on Fivetran's ability to effectively integrate these technologies and teams.

The future openness and flexibility of dbt Core is being questioned, with significant consequences for enterprise data teams that rely on open-source tooling for agility, security, and control.
dbt’s rapid adoption, now exceeding 80,000 projects, was fueled by its permissive Apache License and a vibrant, collaborative community. This openness allowed organizations to deploy, customize, and extend dbt to fit their needs, and enabled companies like Datacoves to build complementary tools, sponsor open-source projects, and simplify enterprise data workflows.
However, recent moves by dbt Labs, accelerated by the Fivetran merger, signal a natural evolution toward monetization and enterprise alignment:
1. Licensing agreement with Snowflake
2. Rewriting dbt Core as dbt Fusion under a more restrictive ELv2 license
3. Introducing a “freemium” model for the dbt VS Code Extension, limiting free use to 15 registered users per organization

While these steps are understandable from a business perspective, they introduce uncertainty and anxiety within the data community. The risk is that the balance between open innovation and commercial control could tip, raising understandable questions about long-term flexibility that enterprises have come to expect from dbt Core.
dbt Labs and Fivetran have both stated that dbt Core's license would not change, and I believe them. The vast majority of dbt users are using dbt Core and changing the licenses risks fragmentation and loss of goodwill in the community. The future vision for dbt is not dbt Core, but instead dbt Fusion.
While I see a future for dbt Core, I don't feel the same about SQLMesh. There is little chance that the dbt Fivetran organization would continue to invest in two open-source projects. It is also unlikely that SQLMesh innovations would make their way into dbt Core, as that would directly compete with dbt Fusion.
Recent history offers important cautionary tales for enterprises. While not a direct parallel, it’s worth learning from:
1. Terraform: A license change led to fragmentation and the creation of OpenTofu, eroding trust in the original steward.
2. ElasticSearch: License restrictions resulted in the OpenSearch fork, dividing the community and increasing support risks.
3. Redis and MongoDB: Similar license shifts caused forks or migrations to alternative solutions, increasing risk and migration costs.
For enterprise data leaders, these precedents highlight the dangers of vendor fragmentation, increased migration costs, and uncertainty around long-term support. When foundational tools become less open, organizations may face difficult decisions about adapting, migrating, or seeking alternatives. If you're considering your options, check out our Platform Evaluation Worksheet.
On the other hand, there are successful models where open-source projects and commercial offerings coexist and thrive:
1. Airflow: Maintains a permissive license, with commercial providers offering managed services and enterprise features.
2. GitLab, Spark, and Kafka: Each has built a sustainable business around a robust open-source core, monetizing through value-added services and features.
These examples show that a healthy open-source core, supported by managed services and enterprise features, can benefit all stakeholders, provided the commitment to openness remains.
To navigate the evolving landscape, enterprises should:
1. Monitor licensing and governance changes closely.
2. Engage in community and governance discussions to advocate for transparency.
3. Plan for contingencies, including potential migration or multi-vendor strategies.
4. Diversify by avoiding over-reliance on a single vendor or platform.
Avoid Vendor Lock-In:
1. Continue to leverage multiple tools for data ingestion and orchestration (e.g., Airflow) instead of relying solely on a single vendor’s stack.
2. Why? This preserves your ability to adapt as technology and vendor priorities evolve. While tighter tool integration is a potential promise of consolidation, options exist to reduce the burden of a multi-tool architecture.
For instance, Datacoves is built to help enterprises maintain governance, reliability, and freedom of choice to deploy securely in their own network, specifically supporting multi-tool architectures and open standards to minimize vendor lock-in risk.
Demand Roadmap Transparency:
1. Engage with your vendors about their product direction and advocate for community-driven development.
2. Why? Transparency helps align vendor decisions with your business needs and reduces the risk of disruptive surprises.
Participate in Open-Source Communities:
1. Contribute to and help maintain the open-source projects that underpin your data platform.
2. Why? Active participation ensures your requirements are heard and helps sustain the projects you depend on.
Attend and Sponsor Diverse Conferences:
1. Support and participate in community-driven events (such as Airflow Summit) to foster innovation and avoid concentration of influence.
2. Why? Exposure to a variety of perspectives leads to stronger solutions and a healthier ecosystem.
Support OSS Creators Financially and Through Advocacy:
1. Sponsor projects or directly support maintainers of critical open-source tools.
2. Why? Sustainable funding and engagement are vital for the health and reliability of the open-source ecosystem.
Encourage Openness and Diversity
1. Champion Diversity in OSS Governance: Advocate for broad, meritocratic project leadership and a diverse contributor base.
2. Why? Diverse stewardship drives innovation, resilience, and reduces the risk of any one entity dominating the project’s direction.
Long-term analytics success isn’t just about technology selection. It’s about actively shaping the ecosystem through strategic diversification, transparent vendor engagement, and meaningful support of open standards and communities. Enterprises that invest in these areas will be best equipped to thrive, no matter how the vendor landscape evolves.
While both dbt Labs and Fivetran have stated that the dbt Core license would remain permissive, to preserve trust and innovation in the data community, dbt Fivetran should commit to neutral governance and open standards for dbt Core, ensuring it remains a true foundation for collaboration, not fragmentation.
It is common knowledge that the dbt community has powered a remarkable flywheel of innovation, career growth, and ecosystem expansion. Disrupting this momentum risks technical fragmentation and loss of goodwill, outcomes that benefit no one in the analytics landscape.
To maintain community trust and momentum, dbt Fivetran should:
1. Establish Neutral Governance:
Place dbt Core under independent oversight, where its roadmap is shaped by a diverse set of contributors, not just a single commercial entity. Projects like Iceberg have shown that broad-based governance sustains engagement and innovation, compared to more vendor-driven models like Delta Lake.
2. Consider Neutral Stewardship Models:
One possible long-term approach that has been seen in projects like Iceberg and OpenTelemetry is to place an open-source core under neutral foundation governance (for example, the Linux Foundation or Apache Software Foundation).
While dbt Labs and Fivetran have both reaffirmed their commitment to keeping dbt Core open, exploring such models in the future could further strengthen community trust and ensure continued neutrality as the platform evolves.
3. Encourage Meritocratic Development: Empower a core team representing the broader community to guide dbt Core’s future. This approach minimizes the risk of forks and fragmentation and ensures that innovation is driven by real-world needs.
4. Apply Lessons from MetricFlow: When dbt Labs acquired MetricFlow and changed its license to BSL, it led to further fragmentation in the semantic layer space. Now, with MetricFlow relicensed as Apache and governed by the Open Semantic Interchange (OSI) initiative (including dbt Labs, Snowflake, and Tableau), the project is positioned as a vendor-neutral standard. This kind of model should be considered for dbt Core as well.
1. Technical teams: By ensuring continued access to an open, extensible framework, and reducing the risk of disruptive migration.
2. Business leaders: By protecting investments in analytics workflows and minimizing vendor lock-in or unexpected costs.
Solidifying dbt Core as a true open standard benefits the entire ecosystem, including dbt Fivetran, which is building its future, dbt Fusion, on this foundation. Taking these steps would not only calm community anxiety but also position dbt Fivetran as a trusted leader for the next era of enterprise analytics.
The dbt Fivetran merger represents a defining moment for the modern data stack, promising streamlined workflows while simultaneously raising critical questions about vendor lock-in, open-source governance, and long-term flexibility. Successfully navigating this shift requires a proactive, diversified strategy, one that champions open standards and avoids over-reliance on any single vendor. Enterprises that invest in active community engagement and robust contingency planning will be best equipped to maintain control and unlock maximum value from their analytics platforms.
If your organization is looking for a way to mitigate these risks and secure your workflows with enterprise-grade governance and multi-tool architecture, Datacoves offers a managed platform designed for maximum flexibility and control. For a deeper look, find out what Datacoves has to offer.
Ready to take control of your data future? Contact us today to explore how Datacoves allows organizations to take control while still simplifying platform management and tool integration.
