Tooling

Deep dives into modern data tools, platforms, and integrations to power your data stack.
Snowflake Won't Build Your Data Platform For You Thumbnail
5 mins read

Snowflake is one of the best data warehouses available. But buying it doesn't give you a data platform. A working platform also requires an engineering environment where your team can develop consistently, orchestration to run and monitor pipelines, CI/CD to enforce quality before anything reaches production, and ways of working that make the whole thing maintainable as your team grows. Most Snowflake implementations deliver the warehouse. The platform layer around it, and the practices underneath it, are usually left for your team to figure out after the SI rolls off. That gap is where most implementations quietly fail. 

Buying Snowflake gives you a warehouse. A working data platform requires an engineering environment, orchestration, CI/CD, and ways of working that don't come with the warehouse contract. 

What a Data Platform Actually Requires 

A data warehouse stores and processes data. That's what it was designed to do, and Snowflake does it exceptionally well. 

A data platform does something different. It's the environment where your team develops, tests, deploys, and monitors data products. It includes the tools, the conventions, and the ways of working that determine whether your data is trustworthy, usable, and maintainable at scale. 

The distinction matters because most implementations are scoped around the warehouse. The platform layer gets treated as something that will sort itself out later. It rarely does. 

Think about it in two layers. 

The first is what your users experience: whether they trust the data, whether they can find and understand it, and whether business and technical teams can communicate around it. This includes trustworthiness, usability, collaboration. 

The second is what makes those outcomes possible at the platform level: whether data products can be reused without rebuilding from scratch, whether the system is maintainable when people leave or the team grows, and whether pipelines are reliable enough that failures get caught early instead of surfacing in a meeting. These include reusability, maintainability, reliability. 

Most Snowflake implementations deliver storage and compute. The six outcomes above are what your business expected the platform to produce. They require deliberate work that sits outside the warehouse contract.

Two Layer Framework

Why Snowflake Alone Doesn't Get You There 

Snowflake is excellent at what it does. Fast queries, elastic scaling, clean separation of storage and compute, a strong security model. If your previous warehouse was on-prem or running on aging infrastructure, the difference is real and immediate. 

The problem isn't Snowflake. The expectation that the warehouse is the platform is. 

Snowflake handles storage, compute, and access control. It doesn't give your team a development environment. It doesn't orchestrate your pipelines or tell you when one failed and why. It doesn't enforce naming conventions, testing standards, or deployment rules. It doesn't document your data models or make them understandable to a business analyst who didn't build them. It doesn't define how your team reviews code, manages branches, or promotes changes from development to production. 

Those things aren't gaps in Snowflake's product. They were never Snowflake's job. 

But when leaders evaluate a warehouse and sign a contract, the scope of what they're buying rarely gets articulated clearly. The demos show fast queries and a clean UI. The pitch covers performance benchmarks and cost savings versus the legacy system. Nobody walks through the engineering environment your team will need to build on top of it, because that's not what the vendor is selling. 

So teams buy a best-in-class warehouse and then spend the next six months discovering everything else they need. Some figure it out. Some don't. And most take a long time to get there. 

How Leaders End Up With a Warehouse and Not Much Else 

There are three common paths to a Snowflake implementation. Each one has real strengths. Each one has a predictable blind spot that leads to the same outcome: a warehouse that works, but fails to deliver the expected results. 

Three Paths

The Vendor Marketing Problem 

Snowflake's marketing is good. That's not a criticism, it's an observation. The positioning is clear, the case studies are compelling, and the product genuinely delivers on the core promise. 

What the marketing doesn't cover is everything that sits around the warehouse. That's not Snowflake's job. Their job is to sell Snowflake. The implicit message, though, is that the hard problem is the warehouse. Once that's solved, everything else follows. 

It doesn't. Leaders who build their implementation strategy around the vendor pitch tend to underscope the project from the start. The warehouse gets stood up on time and on budget. The data engineering environment, the orchestration layer, the governance foundation, those get deferred. Sometimes indefinitely. 

The Internal Enthusiasm Problem 

Every organization has at least one person who comes back from a Snowflake conference ready to modernize everything. That enthusiasm is valuable. It's also frequently mis-channeled. 

Internal champions know the business problem well. They've seen the pain. What they often don't have is deep experience building and operating a production data platform from scratch. They know what good outcomes look like. They haven't necessarily seen what a well-built foundation looks like underneath those outcomes. 

So the implementation gets shaped around what they know: the warehouse, the transformation tool, maybe a basic orchestration setup. The harder questions around developer environments, CI/CD, testing standards, secrets management, and deployment conventions don't get asked because nobody in the room has been burned by skipping them before. 

The SI Migration Problem 

A migration is not a platform implementation. The SI's job is to get your data into Snowflake. Whether the environment your team inherits is maintainable and built on sound engineering practices is usually outside the engagement scope. 

System integrators are good at migrations. Moving data from point A to point B, replicating existing logic in a new tool, hitting a go-live date. That's what most of them are scoped and incentivized to deliver. 

It's not that SIs cut corners. It's that "build a production-grade data engineering platform with sustainable ways of working" wasn't in the statement of work. 

What gets handed off is a warehouse with some tables, some transformation logic, and documentation that will be out of date within a month. The team that inherits it then spends the next year figuring out how to operate it at scale. 

If you're evaluating implementation partners, here's what to look for before you sign. 

What Gets Skipped When You Rush the Foundation 

When the implementation is scoped around the warehouse and the migration, a predictable set of things gets deferred. Not because anyone decided they didn't matter, but because they weren't on the project plan. 

Here's what that looks like in practice six to twelve months later. 

Snowflake costs start climbing. Without well-structured data models, query optimization standards, and sensible clustering strategies, warehouses burn credits fast. Teams that skipped the engineering foundation often spend the first year optimizing for cost rather than delivering new capabilities. The savings from migrating off the legacy system quietly get absorbed by an inefficient Snowflake setup. 

Business users don't trust the data. When there are no testing standards, no documentation conventions, and no consistent naming across models, analysts spend more time validating numbers than using them. The platform gets a reputation for being unreliable. People go back to Excel because nobody built the layer that makes data understandable and trustworthy. 

The team can't move fast. Without CI/CD pipelines, code reviews, and deployment guardrails, every change is a risk. Engineers slow down because they're afraid of breaking something. Onboarding a new team member takes weeks because the knowledge lives in people's heads, not in the system. 

Pipelines break in ways nobody sees coming. Without orchestration that handles dependencies, retries, and failure alerts, pipeline failures surface downstream. A business user notices the numbers are wrong before the data team does. That erodes trust fast and is hard to rebuild. 

The foundation debt compounds. Every week that passes without fixing the underlying structure makes it harder to fix. New models get built on top of a shaky base. Refactoring becomes expensive. The team that was supposed to be delivering new data products spends its time maintaining what already exists. 

This is the real cost of the quick win approach. Six months of fast progress followed by years of slow, careful, expensive work to undo the shortcuts. 

We've documented what that looks like in practice here

Tools and Ways of Working Have to Go Together 

Most implementation conversations focus on the tool stack. Which warehouse, which transformation framework, which orchestrator. Those are real decisions and they matter. 

But the teams that deliver reliable data products consistently aren't just using the right tools. They're using them the same way across every engineer on the team. 

That's the ways of working problem. And it's the part nobody puts in the project plan. 

A team with Snowflake and dbt but no agreed branching strategy, no code review process, no testing standards, and no deployment conventions is still fragile. One engineer builds models one way. Another builds them differently. A third inherits both and must figure out which approach is "correct" before they can extend anything. The system never enforced a consistent approach. 

The same applies to orchestration. Airflow is powerful. An Airflow environment where every engineer writes DAGs differently, secrets are managed inconsistently, and there's no standard for how pipeline failures get handled is not an asset. It's a maintenance problem waiting to get worse. 

Good data engineering is a thought-out combination of tools and conventions that work together. The conventions are what make the tools scale beyond the person who set them up. 

This is why the two-layer framework matters in practice. Trustworthiness, usability, and collaboration aren't outcomes you get from buying the right tools. They're outcomes you get when the platform layer underneath, the reusability, maintainability, and reliability, is built deliberately. With both the right tooling and the right ways of working enforced by the system itself, not by people remembering to follow a document. 

The teams that figure this out usually do it the hard way. They run into the problems first, then back into the conventions that would have prevented them. That process can take years and a lot of frustration. Getting the ways of working right from the start compresses that timeline significantly. 

Doing It Right Upfront Is the Fast Path 

Fast Path Debt Path
The teams that move fastest twelve months in are almost always the ones who slowed down at the start. 

The most common objection to investing in the foundation is time. Leaders have stakeholders who want results. Boards want dashboards. The business wants answers. Spending eight weeks building an engineering environment and establishing conventions feels like the opposite of moving fast. 

That instinct is understandable. It's also wrong. 

The teams that move fastest twelve months in are almost always the ones who slowed down at the start. Not forever. For a few weeks. Long enough to get the development environment right, establish the conventions, wire up CI/CD, and make sure the orchestration layer is solid before anyone builds on top of it. 

The teams that skipped that work aren't moving fast. They're managing debt. Every new model gets built carefully because nobody is sure what it might break. Every pipeline change requires manual testing because the automated checks were never put in place. Every new hire takes weeks to get productive because the knowledge lives in people, not in the system. 

A quick start that skips the foundation isn't free. It's a loan at a high interest rate. The payments start small and get larger every month. 

The same logic applies here. A quick start that skips the foundation isn't free. It's a loan at a high interest rate. The payments start small and get larger every month. 

Getting the foundation right upfront doesn't mean months of invisible infrastructure work before anyone sees results. Done well, it takes weeks, not quarters. And what you get on the other side is a team that ships twice a week without being afraid of what they might break, data that business users trust, and a platform that gets easier to extend as it grows rather than harder. 

That's not slow. That's the fast path. 

Before you sign with anyone, there's a specific set of questions worth asking your SI or platform vendor. We covered them in detail here

How Datacoves Compresses the Foundation Work 

Most teams face a choice at the start of a data platform project. Build the foundation properly and accept that it takes time. Or skip it and move fast now, knowing you'll pay for it later. 

Datacoves is built around the idea that you shouldn't have to make that trade-off. 

It's an enterprise data engineering platform that runs inside your private cloud and comes with the foundation pre-built. Managed dbt and Airflow, a VS Code development environment your engineers can open on day one, CI/CD pipelines that enforce quality before anything reaches production, and an architecture built on best practices that your team inherits rather than invents. 

The conventions, the guardrails, the deployment workflows, the secrets management, the testing framework. None of that gets figured out after the fact. It's already there. 

That's what compresses the timeline. Not shortcuts. Not skipping steps. The foundation work is done, and your team starts from a position that most organizations spend a year trying to reach on their own. 

The result is a team that ships consistently from early on, data that business users trust because quality is enforced by the system rather than by people remembering to check, and a platform that gets easier to extend as it grows. 

Guitar Center onboarded in days. Johnson and Johnson described it as a framework accelerator. Those outcomes aren't the result of moving fast and fixing problems later. They're the result of starting with a foundation that didn't need to be fixed. 

Snowflake is a great warehouse. The teams that get the most out of it aren't the ones who bought it and figured out the rest later. They're the ones who treated the platform layer as part of the project from the start. The tool doesn't build the platform. That part is still your decision to make. 

Getting started with dbt and snowflake
5 mins read

Setting up dbt with Snowflake takes four steps: install the dbt-snowflake adapter with pip, configure a Snowflake user with key pair authentication, set up profiles.yml, and verify the connection with dbt debug

From there, add a few packages (dbt-coves, dbt_constraints, dbt_semantic_view), install SQLFluff and the right VS Code extensions, and you're ready to build. 

The full setup is straightforward for one developer. It gets expensive across a team, which is where managed dbt platforms come in. 

This guide walks through each step, the tooling that's worth adding, and when it makes sense to stop maintaining the setup yourself. 

What You Need to Run dbt with Snowflake 

Before you can run dbt against Snowflake, you need three things on your machine and one thing in Snowflake: 

On your machine: 

  • Python 3.9 or later. The dbt-snowflake adapter no longer supports older versions. Python 3.11 or 3.12 is a good default. 
  • Git. Required for dbt projects, version control, and CI/CD. If you don't already have it, follow GitHub's setup guide

In Snowflake: 

  • An account where you can create roles, databases, and warehouses, or admin support to do it for you. Do not use ACCOUNTADMIN for day-to-day dbt work. 

That's the short list. The next sections walk through each piece. 

Install Python, Git, and the dbt-snowflake Adapter 

Once Python, Git, and VS Code are installed, the only thing left to install locally is the dbt adapter for Snowflake. 

Use a virtual environment 

Install dbt inside a virtual environment, not against your system Python. A venv keeps your dbt dependencies isolated from other Python projects and makes upgrades safe: 

python -m venv .venv 
source .venv/bin/activate    # macOS/Linux 
.venv\Scripts\activate       # Windows 

Activate the venv every time you work on the project. Tools like uv or pyenv are also worth looking at if you're managing multiple Python versions across projects.

Install dbt-snowflake

Open a terminal and run:

pip install dbt-snowflake 

This installs dbt-core and the Snowflake adapter together. The adapter version pins a compatible dbt-core, so in most cases you don't need to specify versions yourself. 

If you need a specific version for a project that's pinned to an older release, install it explicitly: 

pip install dbt-snowflake==<version number> 

Confirm the install worked:

dbt --version 

You should see both dbt-core and dbt-snowflake listed.

Configure Your Snowflake Account for dbt  

Before dbt can connect to Snowflake, you need a Snowflake user with the right permissions, a role for that user to assume, a database where dbt can build models, and a warehouse for dbt to use as compute. You also need an authentication method. As of late 2025, that means key pair authentication, not a password. 

Create the Role, Database, and Warehouse  

For a typical dbt setup, create a dedicated role, database, and warehouse rather than reusing existing ones. This keeps dbt's footprint isolated and easy to govern. 

Run the following as a user with SECURITYADMIN privileges (or higher, but avoid ACCOUNTADMIN for day-to-day work): 

-- Create a warehouse for dbt compute 
create warehouse transforming 
  warehouse_size = 'xsmall' 
  auto_suspend = 60 
  auto_resume = true 
  initially_suspended = true; 
 
-- Create a database where dbt will build models in development 
create database analytics_dev; 
 
-- Create a role for dbt developers 
create role analyst; 
 
-- Grant ownership of the dev database to the role 
grant ownership on database analytics_dev to role analyst; 
 
-- Grant warehouse usage to the role 
grant usage on warehouse transforming to role analyst; 
 
-- Grant the role to your user 
grant role analyst to user your_username; 

When dbt runs, it creates a schema for each developer inside analytics_dev and uses the transforming warehouse for compute. Production deployments typically use a separate role, database, and warehouse, governed through CI/CD rather than developer accounts. 

For a more comprehensive Snowflake permission model (read-only roles, environment-specific access, masking policies, RBAC at scale), see How to Configure Snowflake for dbt on the dbt blog. We'll also cover infrastructure-as-code options for managing this further down. 

Set Up Key Pair Authentication 

Key pair authentication is the correct default for connecting dbt to Snowflake. As of November 2025, Snowflake enforces MFA on username/password logins, which makes password authentication unworkable for any unattended dbt run. 

Step 1. Generate a key pair on your machine.

# Generate an unencrypted private key 
openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8 -nocrypt 
 
# Generate the matching public key 
openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub 

Windows users: install OpenSSL via Git for Windows (which bundles it). 

For production or CI/CD environments, store the private key in a secrets manager rather than on developer machines. 

Step 2. Register the public key with your Snowflake user. 

In Snowflake, run:

alter user your_username set rsa_public_key='<paste the contents of rsa_key.pub here, without the BEGIN/END lines>'; 

Step 3. Reference the private key from profiles.yml.

dbt supports either a path to the private key file or the key contents inline. We'll set this up in the next section. 

For SSO environments where browser-based authentication is acceptable for local development, externalbrowser is also supported, but it can't be used for unattended runs. For most teams, key pair auth is the consistent answer across local development, CI, and production.

Method Local Dev CI/CD and Production Notes
Key pair (recommended) Yes Yes Consistent across all environments. Works without MFA prompts.
External browser (SSO) Yes No Useful for ad-hoc development. Cannot run unattended.
Username and password Limited No Snowflake now enforces MFA on these logins. Effectively dead for dbt.
OAuth Yes Yes (with extra setup) Strong option for teams already using OAuth with their IdP.

Configure dbt and Verify the Snowflake Connection 

With Snowflake configured, the next step is to point dbt at it. dbt reads connection details from a file called profiles.yml, which lives in your home directory at ~/.dbt/profiles.yml. Project-level Snowflake behavior (table types, query tags, warehouse overrides) lives in dbt_project.yml inside the project itself. 

Initialize the Project with dbt init

If you're starting from scratch, dbt init creates a new project and prompts you for connection details:

dbt init my_project 

If you're cloning an existing project, run dbt init from inside the cloned repo to set up your profiles.yml entry without overwriting the project files. 

The init flow asks for the database type, account identifier, user, authentication method, role, database, warehouse, schema, and threads. The result is a working profiles.yml entry that looks like this: 

my_project: 
  target: dev 
  outputs: 
    dev: 
      type: snowflake 
      account: abc12345.us-east-1 
      user: your_username 
      private_key_path: /Users/your_username/.snowflake/rsa_key.p8 
      role: analyst 
      database: analytics_dev 
      warehouse: transforming 
      schema: dbt_your_username 
      threads: 8 

A few notes: 

  • private_key_path points to wherever you saved the private key you generated. Use the absolute path. The ~/ shorthand isn't always reliable in profiles.yml. 
  • schema is the developer's personal schema. The convention dbt_<username> prevents developers from stepping on each other. 
  • threads controls how many models dbt builds in parallel. 8 is a reasonable starting point. 

If you maintain a project that other developers will clone, add a profile_template.yml at the project root. It pre-fills the fixed values (account, role, database, warehouse) and only prompts each developer for what's truly user-specific (their username, schema, threads). This saves real time across a team. 

Run dbt debug to Verify the Connection  

Before doing anything else, confirm dbt can connect to Snowflake:

dbt debug 

If everything is configured correctly, you'll see All checks passed! at the bottom of the output. If you get an error, the most common causes are: 

  • Wrong account identifier format (Snowflake account IDs vary by region and cloud). 
  • Public key not registered against the user, or registered with the BEGIN/END lines included. 
  • Role missing USAGE on the warehouse or OWNERSHIP on the database. 
  • Wrong private key path, or the key file has restrictive permissions Python can't read. 

If you're stuck, the #db-snowflake channel on the dbt Community Slack is the fastest way to get unstuck. 

Useful profiles.yml Settings to Know

dbt init gives you a working baseline, but a few profiles.yml settings are worth knowing about once you start running dbt regularly: 

  • reuse_connections: true keeps Snowflake connections alive across queries, which speeds up runs noticeably and is especially helpful with SSO. 
  • client_session_keep_alive: true prevents Snowflake from timing out long sessions during big builds. 
  • query_tag sets a default tag on every query dbt issues. This makes it easy to filter dbt activity in QUERY_HISTORY (we'll cover model-level overrides in the next section). 
  • connect_retries and connect_timeout are worth tuning if you hit transient connection failures. 

Full reference: dbt-snowflake profile configuration

Useful dbt_project.yml Settings for Snowflake

Where profiles.yml controls how dbt connects, dbt_project.yml controls how dbt builds against Snowflake. A few Snowflake-specific configs are worth knowing about: 

Transient tables. Snowflake transient tables skip Fail-safe storage, which reduces cost. dbt creates transient tables by default. To make a folder of models permanent (for example, models that need Time Travel beyond one day or Fail-safe protection): 

models: 
  my_project: 
    marts: 
      +transient: false 

Query tags at the model level. Set a default in profiles.yml and override per model or folder in dbt_project.yml

models: 
  my_project: 
    finance: 
      +query_tag: "finance_models" 

Copy grants on rebuild. When dbt rebuilds a table, grants on the previous table are dropped by default. To preserve them:

models: 
  my_project: 
    +copy_grants: true 

Warehouse override. Most models can run on a small warehouse, but a few heavy ones may need more compute. Override per model or folder rather than running everything on a large warehouse:

models: 
  my_project: 
    heavy_marts: 
      +snowflake_warehouse: "transforming_xl" 

This also works for tests, which is useful when you want lightweight tests on a smaller warehouse than your model builds. 

The full list of Snowflake-specific configs lives in the dbt Snowflake configurations reference

dbt Packages and Python Libraries Worth Adding

dbt is most useful when paired with the right packages and Python libraries. The list below isn't exhaustive, but each of these earns its place in a serious dbt-on-Snowflake project. 

Package / Library What It Does When to Add It
dbt-coves CLI that generates staging models, source YAML, and property files from warehouse metadata From day one. Saves hours of boilerplate on every project.
dbt_constraints Turns dbt tests into actual Snowflake primary key, unique key, foreign key, and not-null constraints with RELY Once you have a stable set of tests. Improves query optimization and data modeling tool support.
dbt_semantic_view Adds a semantic_view materialization so Snowflake semantic views are managed in dbt When you're using or planning to use Snowflake semantic views, Cortex Analyst, or Snowflake Intelligence.
SQLFluff SQL linter that understands Jinja and dbt syntax Before the codebase grows past a handful of models. Easier to start clean than retrofit.
dbt-checkpoint Pre-commit hooks that validate model documentation, tests, and naming conventions Before merging the first feature branch. Stops technical debt from compounding.

dbt-coves (Datacoves)

dbt-coves is an open-source CLI tool maintained by Datacoves. It automates the tedious parts of dbt development that nobody enjoys doing by hand: generating source definitions, staging models, property files, and Airflow DAGs from your warehouse metadata. 

Install it with pip:

pip install dbt-coves 

Most teams use it for staging model generation. Point it at a source schema and it produces clean staging models, source YAML, and the matching property files in seconds. For analytics engineers who model dozens of source tables, this saves hours per project. 

dbt-coves also includes utilities for backing up Airbyte and Fivetran configurations, which is useful when you want your ingestion config to live in Git alongside your dbt models. 

dbt_constraints (Snowflake Labs)

dbt_constraints is a Snowflake Labs package that turns your existing dbt tests into actual database constraints. If you've already added unique, not_null, and relationships tests, this package will generate matching primary key, unique key, foreign key, and not-null constraints on Snowflake automatically. 

Add it to packages.yml

packages: 
  - package: Snowflake-Labs/dbt_constraints 
    version: [">=1.0.0", "<2.0.0"] 

Why bother, given that Snowflake doesn't enforce most constraints? 

  • Query performance. Snowflake's optimizer uses primary key, unique key, and foreign key constraints during query rewrite when they're set to RELY. dbt_constraints creates constraints with RELY automatically when the underlying test passes, and NORELY when it fails. The optimizer can use this for join elimination, which removes unnecessary tables from query plans. 
  • Data modeling tools. BI and modeling tools like DBeaver and Oracle SQL Developer Data Modeler can reverse-engineer accurate data model diagrams when constraints exist. Without constraints, those diagrams are guesswork. 
  • Documentation that's always in sync. The constraints in your warehouse match what dbt actually tests. There's no drift between "what the tests say" and "what the database knows." 

dbt_semantic_view (Snowflake Labs)

dbt_semantic_view is a newer Snowflake Labs package that adds a semantic_view materialization to dbt. It lets you define and version-control Snowflake's native semantic views the same way you manage models. 

Add it to packages.yml:

packages: 
  - package: Snowflake-Labs/dbt_semantic_view 
    version: [">=1.0.0", "<2.0.0"] 

A semantic view model looks like this:

{{ config(materialized='semantic_view') }} 
 
TABLES ( 
  orders AS {{ ref('fct_orders') }}, 
  customers AS {{ ref('dim_customers') }} 
) 
 
RELATIONSHIPS ( 
  orders_to_customers AS orders (customer_id) REFERENCES customers (customer_id) 
) 
 
DIMENSIONS ( 
  customers.region AS region, 
  orders.order_date AS order_date 
) 
 
METRICS ( 
  orders.total_revenue AS SUM(orders.amount), 
  orders.order_count AS COUNT(orders.order_id) 
) 

Once materialized, the semantic view is a real Snowflake object. It can be consumed by Cortex Analyst, Snowflake Intelligence, and any tool that queries Snowflake. Because the definition lives in your dbt project, metric logic gets the same Git history, peer review, and CI/CD as your transformations. 

This matters more than it sounds. Most semantic layers either live outside dbt (drift inevitable) or get reinvented in every BI tool (drift guaranteed). Defining the semantic layer in dbt and materializing it natively in Snowflake closes that gap. 

SQLFluff

SQLFluff is the de facto SQL linter for dbt. It enforces formatting and style rules across your project so reviewers can focus on logic, not whether someone used trailing commas or capitalized SQL keywords. 

Install it alongside dbt: 

pip install sqlfluff sqlfluff-templater-dbt 

The sqlfluff-templater-dbt plugin lets SQLFluff understand Jinja, refs, sources, and macros. Without it, the linter chokes on dbt syntax. Configure rules in a .sqlfluff file at the project root, and add a dbt_project.yml reference so the templater can find your project. 

Datacoves sponsors SQLFluff as part of its commitment to open-source dbt tooling. 

dbt-checkpoint (Datacoves)

dbt-checkpoint is a set of pre-commit hooks that validate dbt project quality before code is merged. It catches the things code review usually misses: a model without a description, a column that's documented in YAML but missing from the SQL, a source that's been added without tests. 

Install it as part of your pre-commit setup: 

pip install pre-commit 

Then add the dbt-checkpoint hooks to .pre-commit-config.yaml:

repos: 
  - repo: https://github.com/dbt-checkpoint/dbt-checkpoint 
    rev: v2.0.7 # Verify the latest released version of dbt-checkpoint  
    hooks: 
  	- id: check-model-has-description 
  	- id: check-model-columns-have-desc 
  	- id: check-model-has-tests 
  	- id: check-source-has-freshness 
  	- id: check-script-has-no-table-name 

Run pre-commit install once and the hooks fire automatically on every commit. 

The point isn't to enforce every possible rule. It's to keep technical debt from accumulating before it has a chance to compound. Datacoves maintains dbt-checkpoint as part of the broader dbt ecosystem. 

For a broader look at testing strategy, see An Overview of Testing Options for dbt

VS Code Extensions That Make dbt on Snowflake Easier

VS Code is the default IDE for dbt development. A few extensions turn it from "a nice editor" into a productive dbt workspace. 

Snowflake VS Code Extension

The official Snowflake extension brings the Snowsight experience into VS Code. You can browse databases, run worksheets, view query results, and upload or download files from Snowflake stages, all without leaving the editor. 

For dbt developers, the most useful part is being able to run ad-hoc queries against your warehouse next to the model you're working on. No more flipping between the browser and your IDE every time you need to inspect a column or check a row count. 

Power User for dbt (aka. dbt Power User)

Power User for dbt (formerly called dbt Power User) is the most useful dbt extension. It adds the things dbt should arguably ship with itself: 

  • Run a model, test, or full DAG with a click instead of typing the command. 
  • Preview the result of a model or any selected CTE inline. (contributed by Datacoves) 
  • Click through ref() and source() calls to jump to the underlying file. 
  • See the compiled SQL side-by-side with the Jinja source. (contributed by Datacoves) 
  • Visualize the lineage graph from a model. 

If you only install one extension, install this one. 

SQLFluff Extension

The SQLFluff VS Code extension wires the SQLFluff linter directly into the editor. Linting errors show up inline as you type, with hover descriptions that link to the SQLFluff docs. 

This is the difference between linting being a chore developers run occasionally and linting being something they fix as they write. The former gets ignored. The latter keeps the codebase clean. 

The extension reads from the same .sqlfluff config file that the CLI uses, so there's no duplicate setup. 

Bringing AI Into Your dbt on Snowflake Workflow 

A modern dbt-on-Snowflake AI workflow combines an in-IDE assistant (Power User for dbt, GitHub Copilot, Claude Code) with a Snowflake-native assistant (Snowflake Cortex CLI) and MCP servers that give the AI structured access to your dbt project and warehouse metadata. 

AI has moved past being a novelty in dbt development. Used well, it accelerates the work that doesn't need a human (writing tests, generating documentation, drafting models, explaining errors) and gives developers more time for the work that does (modeling decisions, business logic, architecture). 

A modern dbt-on-Snowflake workflow has a few good options. 

Snowflake Cortex CLI (CoCo). Snowflake's command-line AI assistant runs against your Snowflake account and works like Claude Code or other terminal-based coding assistants. It's particularly useful for dbt because it can find tables and columns, inspect schemas, and generate SQL grounded in your actual warehouse, not a generic LLM guess. 

Read more: Datacoves Expands Snowflake AI Data Cloud Support

Claude Code, GitHub Copilot, OpenAI Codex CLI, Gemini CLI. Each of these works inside VS Code or the terminal. Claude Code and Codex CLI are particularly strong for multi-step refactors across a dbt project. Copilot is hard to beat for inline suggestions. The right choice depends on what your organization already pays for and what data your security team is comfortable sending to which provider. 

MCP servers. Model Context Protocol servers let AI assistants interact with dbt projects, Snowflake, and other tools through a standardized interface. Snowflake and the broader community have shipped MCP servers. Pairing an MCP server with an AI assistant gives the model real awareness of warehouse metadata. 

The thing to avoid is treating AI as a separate workflow. The point is to integrate it into the same VS Code environment where developers already work, with credentials and access already configured. Asking developers to copy-paste between a chat window and their IDE is friction the team will route around within a week. 

This is one of the harder parts of running dbt on Snowflake at scale: keeping AI tooling consistent across developers, with the right credentials, the right MCP servers, and the right governance around what data the AI can see. Datacoves comes preconfigured with Claude Code, Snowflake Cortex CLI, GitHub Copilot, OpenAI Codex CLI, and Gemini CLI inside the in-browser VS Code environment, all working against your Snowflake account with no per-developer setup. For teams that want to standardize how AI shows up in dbt development, that's a meaningful head start. 

Managing Snowflake Infrastructure Alongside dbt

dbt manages objects inside Snowflake (tables, views, tests, documentation). It does not manage Snowflake itself. Roles, users, grants, warehouses, masking policies, and resource monitors live outside dbt's scope and need a separate infrastructure-as-code tool. 

Snowflake roles, users, grants, warehouses, masking policies, row access policies, network policies, resource monitors, and databases all live outside dbt's scope. Most teams handle this with whatever combination of click-ops, Snowsight, and SQL scripts has accumulated over the years. That works until it doesn't. 

The point at which it stops working is usually predictable: 

  • A new analyst joins and needs the right access. Nobody can fully reconstruct what the previous analyst was granted. 
  • A masking policy needs to be applied consistently across thirty tables containing PII. Someone misses three of them. 
  • An audit asks who has OWNERSHIP on production schemas. The answer takes a week to assemble. 
  • A new environment (dev, QA, staging) needs to mirror production. The clone drifts within a sprint because grants are applied manually. 

The fix is to manage Snowflake infrastructure as code, the same way you manage dbt models. Define roles, grants, warehouses, and policies in version-controlled files. Apply changes through pull requests. Let CI/CD enforce that production matches what's in Git. 

Why Terraform isn't a great fit for Snowflake

Terraform is the obvious starting point, but it's the wrong tool for most Snowflake teams. Terraform was built for managing infrastructure across many cloud providers, with a state file as its source of truth. For Snowflake specifically, this creates real problems: 

  • The state file becomes a sync target instead of a record of intent. Drift between Snowflake and the state file happens often, and resolving it is painful. 
  • The Terraform DSL is unfamiliar territory for analytics engineers. Most data teams don't have full-time platform engineers who already speak Terraform. 
  • Snowflake-specific features (RBAC at scale, tag-based masking policies, row access policies) require contortions in Terraform that a Snowflake-native tool can express directly. 

Snowcap: Snowflake-native infrastructure as code

Snowcap is the Snowflake-native IaC tool Datacoves built and maintains as open source. It manages users, roles, grants, warehouses, masking policies, row access policies, and over 60 other Snowflake resource types using YAML or Python configuration. No state file. No DSL to learn. No abstraction layer between your config and Snowflake. 

Snowcap is opinionated where opinion matters most: 

  • RBAC at scale. Define role hierarchies and grants in YAML. Apply consistently across teams, projects, and environments. 
  • Tag-based masking policies. Tag a column once, apply a masking policy to every column with that tag automatically. 
  • Row access policies. Define them once, version them in Git, deploy them like any other Snowflake object. 
  • CI/CD-first. Every change is a pull request. Production state matches what's been merged. 

If dbt is the workshop where you build data products, Snowcap is the power tools that keep the workshop itself in good order. The two work side by side: Snowcap manages who can see what and where compute lives, dbt manages how the data gets transformed. 

For teams already running dbt with Snowflake, adding Snowcap is one of the highest-leverage moves available. It doesn't replace anything you have. It fills the gap that almost every dbt team has but pretends not to: governed, version-controlled, repeatable Snowflake infrastructure. 

When to Stop DIY and Move to Managed dbt

The setup in this guide works. Plenty of teams run it successfully. The honest question isn't whether you can do it yourself. It's whether you should, given what your team is trying to accomplish. 

Here's the pattern most data teams follow: 

At one or two developers, DIY is the right call. The setup is straightforward, the maintenance is low, and the team can iterate on conventions as they go. There's no good reason to add a managed platform at this stage. 

At three to five developers, the cracks start to show. Onboarding a new developer takes a week instead of a day because everyone's local environment is slightly different. Python versions drift. Someone's profiles.yml has a passphrase from 2024 that nobody can find. CI/CD is held together by a YAML file one engineer maintains. It still works, but real time is being lost to platform maintenance. 

At ten or more developers, DIY is expensive. Onboarding tax compounds. Upgrades require coordinating across the whole team. Secrets management becomes a real problem. Multiple dbt projects need governed dependencies. Production runs need an actual orchestrator, not a cron job. CI/CD pipelines need ownership. Someone is now spending a meaningful chunk of their week on platform work that has nothing to do with delivering data products. 

Team Size DIY Verdict What Goes Wrong First
1-2 developers DIY is the right call Nothing. Setup is straightforward, maintenance is low.
3-5 developers DIY starts costing real time Onboarding tax, drifting Python versions, fragile CI/CD. Related: CI/CD With dbt Slim CI.
10+ developers DIY is expensive Platform maintenance becomes someone's part-time job.
Regulated industries (any size) Managed platform with private cloud is usually required SaaS dbt platforms fail security review. DIY adds months of platform engineering.

For regulated industries, DIY runs into a different wall. Pharma, healthcare, financial services, and government workloads usually require private cloud deployment, strict identity controls, audit logging, and architectures that pass internal security review. SaaS dbt platforms are often a non-starter. DIY on Kubernetes is doable, but it pulls in months of platform engineering work before the data team writes a single model. 

The decision isn't really between "DIY" and "managed." It's between who builds and maintains the platform layer. Either your team does it, or someone else does. If platform engineering is your team's competitive advantage, build it yourself. If your team's competitive advantage is delivering data products, the platform layer is overhead. 

See also: dbt Deployment Options

What managed dbt solves

Managed dbt platforms (the category, not the marketing) handle the layer between dbt and the rest of your infrastructure. The good ones cover: 

  • A consistent in-browser or pre-configured VS Code environment, so every developer is on the same Python version, the same dbt version, and the same set of extensions from day one. 
  • Managed Airflow for orchestration, both a personal sandbox for development and a shared production environment. 
  • Pre-built CI/CD pipelines for dbt tests, SQL linting, governance checks, documentation, and deployment. 
  • Secrets management integrated with your existing vault. 
  • Private cloud deployment for teams that need data to stay inside their own network. 
  • Best-practice templates so new projects start with the right structure instead of inventing it. 

Datacoves is the managed dbt platform we build, and the Snowflake integration is one of our most common deployments. Teams running dbt on Snowflake get an end-to-end environment in their own cloud: managed dbt, managed Airflow, in-browser VS Code, CI/CD, governance, and AI tooling, all preconfigured and connected to their Snowflake account. 

For a side-by-side look at the trade-offs, see our comparison of dbt Core vs dbt Cloud

Final Thoughts

dbt and Snowflake is one of the most productive combinations in modern data engineering. The tools fit together, the community is active, and the path from "first model" to "production analytics" is well-trodden. That doesn't mean the path is short. 

The setup itself isn't the hard part. Installing the adapter, configuring authentication, writing profiles.yml, running dbt debug, this is a one-afternoon exercise. The harder part is everything that comes after: keeping ten developers on the same Python version, governing who can do what in Snowflake, integrating AI without creating a mess, deciding which packages are worth their weight, and making the whole thing maintainable as the team grows. 

The tooling in this guide handles most of it. dbt-coves removes the boilerplate. dbt_constraints turns your tests into actual database constraints. dbt_semantic_view brings the semantic layer into your dbt project. SQLFluff and dbt-checkpoint keep code quality from drifting. Power User for dbt makes daily development faster. Snowcap fills the gap dbt was never meant to fill. 

Where it gets expensive is at scale. The setup that works for two developers doesn't scale to twenty without serious investment in the platform layer underneath. Either your team builds and maintains that layer, or you find a managed platform that does it for you. There's no third option that holds up over time. 

If you're running dbt on Snowflake today and the setup is starting to feel heavier than it should, book a free architecture review. We'll discuss your environment, show you where Datacoves fits, and tell you honestly whether it makes sense for where you are. 

Lightning-Fast Data Stack with dlt
5 mins read

A lean analytics stack built with dlt, DuckDB, DuckLake, and dbt delivers fast insights without the cost or complexity of a traditional cloud data warehouse. For teams prioritizing speed, simplicity, and control, this architecture provides a practical path from raw data to production-ready analytics.

In practice, teams run this stack using Datacoves to standardize environments, manage workflows, and apply production guardrails without adding operational overhead.

A lean analytics stack built with dlt, DuckDB, DuckLake, and dbt delivers fast, production-ready insights without the cost or complexity of a traditional cloud data warehouse.

The Lean Data Stack: Tools and Roles

A lean analytics stack works when each tool has a clear responsibility. In this architecture, ingestion, storage, and transformation are intentionally separated so the system stays fast, simple, and flexible.

  • dlt handles ingestion. It reliably loads raw data from APIs, files, and databases into DuckDB with minimal configuration and strong defaults.
  • DuckDB provides the analytical engine. It is fast, lightweight, and ideal for running analytical queries directly on local or cloud-backed data.
  • DuckLake defines the storage layer. It stores tables as Parquet files with centralized metadata, enabling a true lakehouse pattern without a heavyweight platform.
  • dbt manages transformations. It brings version control, testing, documentation, and repeatable builds to analytics workflows.

Together, these tools form a modern lakehouse-style stack without the operational cost of a traditional cloud data warehouse.

Setting Up DuckDB and DuckLake with MotherDuck

Running DuckDB locally is easy. Running it consistently across machines, environments, and teams is not. This is where MotherDuck matters.

MotherDuck provides a managed control plane for DuckDB and DuckLake, handling authentication, metadata coordination, and cloud-backed storage without changing how DuckDB works. You still query DuckDB. You just stop worrying about where it runs.

To get started:

  1. Create a MotherDuck account.
  2. In Settings → Integrations, generate an API token (MOTHERDUCK_TOKEN).
  3. Configure access to your object storage, such as S3, for DuckLake-managed tables.
  4. Export the token as an environment variable on your local machine.

This single token is used by dlt, DuckDB, and dbt to authenticate securely with MotherDuck. No additional credentials or service accounts are required.

At this point, you have:

  • A DuckDB-compatible analytics engine
  • An open table format via DuckLake
  • Centralized metadata and storage
  • A setup that works the same locally and in production

That consistency is what makes the rest of the stack reliable.

Ingesting Data with dlt into DuckDB

In a lean data stack, ingestion should be reliable, repeatable, and boring. That is exactly what dlt is designed to do.

dlt loads raw data into DuckDB with strong defaults for schema handling, incremental loads, and metadata tracking. It removes the need for custom ingestion frameworks while remaining flexible enough for real-world data sources.

In this example, dlt ingests a CSV file and loads it into a DuckDB database hosted in MotherDuck. The same pattern works for APIs, databases, and file-based sources.

To keep dependencies lightweight and avoid manual environment setup, we use uv to run the ingestion script with inline dependencies.

pip install uv
touch us_populations.py
chmod +x us_populations.py

The script below uses dlt’s MotherDuck destination. Authentication is handled through the MOTHERDUCK_TOKEN environment variable, and data is written to a raw schema in DuckDB.

#!/usr/bin/env -S uv run
# /// script
# dependencies = [
#   "dlt[motherduck]==1.16.0",
#   "psutil",
#   "pandas",
#   "duckdb==1.3.0"
# ]
# ///

"""Loads a CSV file to MotherDuck"""
import dlt
import pandas as pd
from utils.datacoves_utils import pipelines_dir

@dlt.resource(write_disposition="replace")
def us_population():
    url = "https://raw.githubusercontent.com/dataprofessor/dashboard-v3/master/data/us-population-2010-2019.csv"
    df = pd.read_csv(url)
    yield df

@dlt.source
def us_population_source():
    return [us_population()]

if __name__ == "__main__":
    # Configure MotherDuck destination with explicit credentials
    motherduck_destination = dlt.destinations.motherduck(
        destination_name="motherduck",
        credentials={
            "database": "raw",
            "motherduck_token": dlt.secrets.get("MOTHERDUCK_TOKEN")
        }
    )

    pipeline = dlt.pipeline(
        progress = "log",
        pipeline_name = "us_population_data",
        destination = motherduck_destination,
        pipelines_dir = pipelines_dir,

        # dataset_name is the target schema name in the "raw" database
        dataset_name="us_population"
    )

    load_info = pipeline.run([
        us_population_source()
    ])

    print(load_info)

Running the script loads the data into DuckDB:

./us_populations.py

At this point, raw data is available in DuckDB and ready for transformation. Ingestion is fully automated, reproducible, and versionable, without introducing a separate ingestion platform.

Transforming Data with dbt and DuckLake

Once raw data is loaded into DuckDB, transformations should follow the same disciplined workflow teams already use elsewhere. This is where dbt fits naturally.

dbt provides version-controlled models, testing, documentation, and repeatable builds. The difference in this stack is not how dbt works, but where tables are materialized.

By enabling DuckLake, dbt materializes tables as Parquet files with centralized metadata instead of opaque DuckDB-only files. This turns DuckDB into a true lakehouse engine while keeping the developer experience unchanged.

To get started, install dbt and the DuckDB adapter:

pip install dbt-core==1.10.17
pip install dbt-duckdb==1.10.0
dbt init

Next, configure your dbt profile to target DuckLake through MotherDuck:

default:
  outputs:
    dev:
      type: duckdb
      # This requires the environment var MOTHERDUCK_TOKEN to be set
      path: 'md:datacoves_ducklake'
      threads: 4
      schema: dev  # this will be the prefix used in the duckdb schema
      is_ducklake: true

  target: dev

This configuration does a few important things:

  • Authenticates using the MOTHERDUCK_TOKEN environment variable
  • Writes tables using DuckLake’s open format
  • Separates transformed data from raw ingestion
  • Keeps development and production workflows consistent

With this in place, dbt models behave exactly as expected. Models materialized as tables are stored in DuckLake, while views and ephemeral models remain lightweight and fast.

From here, teams can:

  • Add dbt tests for data quality
  • Generate documentation and lineage
  • Run transformations locally or in shared environments
  • Promote models to production without changing tooling

This is the key advantage of the stack: modern analytics engineering practices, without the overhead of a traditional warehouse.

When This Stack Makes Sense

This lean stack is not trying to replace every enterprise data warehouse. It is designed for teams that value speed, simplicity, and cost control over heavyweight infrastructure.

This approach works especially well when:

  • You want fast analytics without committing to a full cloud warehouse.
  • Your team prefers open, file-based storage over proprietary formats.
  • You are building prototypes, internal analytics, or domain-specific data products.
  • Cost predictability matters more than elastic, multi-tenant scale.
  • You want modern analytics engineering practices without platform sprawl.

The trade-offs are real and intentional. DuckDB and DuckLake excel at analytical workloads and developer productivity, but they are not designed for high-concurrency BI at massive scale. Teams with hundreds of dashboards and thousands of daily users may still need a traditional warehouse.

Where this stack shines is time to value. You can move from raw data to trusted analytics quickly, with minimal infrastructure, and without locking yourself into a platform that is expensive to unwind later.

In practice, many teams use this architecture as:

  • A lightweight production analytics stack
  • A proving ground before scaling to a larger warehouse
  • A cost-efficient alternative for departmental or embedded analytics

When paired with Datacoves, teams get the operational guardrails this stack needs to run reliably. Datacoves standardizes environments, integrates orchestration and CI/CD, and applies best practices so the simplicity of the stack does not turn into fragility over time.

Teams often run this stack with Datacoves to standardize environments, apply production guardrails, and avoid the operational drag of DIY platform management.

See it in action

If you want to see this stack running end to end, watch the Datacoves + MotherDuck webinar. It walks through ingestion with dlt, transformations with dbt and DuckLake, and how teams operationalize the workflow with orchestration and governance.

The session also covers:

  • When DuckDB and DuckLake work well in production
  • How to add orchestration with Airflow
  • How teams visualize results with Streamlit

Watch the full session here

New Features from the Databricks AI Summit 2025
5 mins read

The Databricks AI Summit 2025 revealed a major shift toward simpler, AI-ready, and governed data platforms. From no-code analytics to serverless OLTP and agentic workflows, the announcements show Databricks is building for a unified future.

In this post, we break down the six most impactful features announced at the summit and what they mean for the future of data teams.

1. Databricks One and Genie: Making Analytics Truly Accessible

Databricks One (currently in private preview) introduces a no-code analytics platform aimed at democratizing access to insights across the organization. Powered by Genie, users can now interact with business data through natural language Q&A, no SQL or dashboards required. By lowering the barrier to entry, tools like Genie can drive better, faster decision-making across all functions.

Datacoves Take: As with any AI we have used to date, having a solid foundation is key. AI can not solve ambiguous metrics and a lack of knowledge. As we have mentioned, there are some dangers in trusting AI, and these caveats still exist.

Making Analytics Truly Accessible
Image credit

2. Lakebase: A Serverless Postgres for the Lakehouse

In a bold move, Databricks launched Lakebase, a Postgres-compatible, serverless OLTP database natively integrated into the lakehouse. Built atop the foundations laid by the NeonDB acquisition, Lakebase reimagines transactional workloads within the unified lakehouse architecture. This is more than just a database release; it’s a structural shift that brings transactional (OLTP) and analytical (OLAP) workloads together, unlocking powerful agentic and AI use cases without architectural sprawl. 

Datacoves Take: We see both Databricks and Snowflake integrating Postgres into their offering. Ducklake is also demonstrating a simpler future for Iceberg catalogs. Postgres has a strong future ahead, and the unification of OLAP and OLTP seems certain.

A Serverless Postgres for the Lakehouse
Image credit

3. Agent Bricks: From Prototype to Enterprise-Ready AI Agents

With the introduction of Agent Bricks, Databricks is making it easier to build, evaluate, and operationalize agents for AI-driven workflows. What sets this apart is the use of built-in “judges” - LLMs that automatically assess agent quality and performance. This moves agents from hackathon demos into the enterprise spotlight, giving teams a foundation to develop production-grade AI assistants grounded in company data and governance frameworks.

Datacoves Take: This looks interesting, and the key here still lies in having a strong data foundation with good processes. Reproducibility is also key. Testing and proving that the right actions are performed will be important for any organization implementing this feature.

From Prototype to Enterprise-Ready AI Agents
Image credit

4. Databricks Apps: Interfaces That Inherit Governance by Design

Databricks introduced Databricks Apps, allowing developers to build custom user interfaces that automatically respect Unity Catalog permissions and metadata. A standout demo showed glossary terms appearing inline inside Chrome, giving business users governed definitions directly in the tools they use every day. This bridges the gap between data consumers and governed metadata, making governance feel less like overhead and more like embedded intelligence.

Datacoves Take: Metadata and catalogs are important for AI, so we see both Databricks and Snowflake investing in this area. As with any of these changes, technology is not the only change needed in the organization. Change management is also important. Without proper stewardship, ownership, and review processes, apps can’t provide the experience promised.

Interfaces That Inherit Governance by Design
Image credit

5. Unity Catalog Enhancements: Open Governance at Scale

Unity Catalog took a major step forward at the Databricks AI Summit 2025, now supporting managed Apache Iceberg tables, cross-engine interoperability, and introducing Unity Catalog Metrics to define and track business logic across the organization.

This kind of standardization is critical for teams navigating increasingly complex data landscapes. By supporting both Iceberg and Delta formats, enabling two-way sync, and contributing to the open-source ecosystem, Unity Catalog is positioning itself as the true backbone for open, interoperable governance.

Datacoves Take: The Iceberg data format has the momentum behind it; now it is up to the platforms to enable true interoperability. Organizations are expecting a future where a table can be written and read from any platform. DuckLake is also getting in the game, simplifying how metadata is managed, and multi-table transactions are enabled. It will be interesting to see if Unity and Polaris take some of the DuckLake learnings and integrate them in the next few years.

Open Governance at Scale
Image credit

6. Forever-Free Tier and $100M AI Training Fund

In a community-building move, Databricks introduced a forever-free edition of the platform and committed $100 million toward AI and data training. This massive investment creates a pipeline of talent ready to use and govern AI responsibly. For organizations thinking long-term, this is a wake-up call: governance, security, and education need to scale with AI adoption, not follow behind.

Datacoves Take: This feels like a good way to get more people to try Databricks without a big commitment. Hopefully, competitors take note and do the same. This will benefit the entire data community.

Read the full post from Databricks here:
https://www.databricks.com/blog/summary-dais-2025-announcements-through-lens-games

What Data Leaders Must Do Next After Databricks AI Summit 2025

Democratizing Data Access Is Critical

With tools like Databricks One and Genie enabling no-code, natural language analytics, data leaders must prioritize making insights accessible beyond technical teams to drive faster, data-informed decisions at every level.

Simplify and Unify Data Architecture

Lakebase’s integration of transactional and analytical workloads signals a move toward simpler, more efficient data stacks. Leaders should rethink their architectures to reduce complexity and support real-time, AI-driven applications.

Operationalize AI Agents for Business Impact

Agent Bricks and built-in AI judges highlight the shift from experimental AI agents to production-ready, measurable workflows. Data leaders need to invest in frameworks and governance to safely scale AI agents across use cases.

Governance Must Span Formats and Engines

Unity Catalog’s expanded support for Iceberg, Delta, and cross-engine interoperability emphasizes the need for unified governance frameworks that handle diverse data formats while maintaining business logic and compliance.

Invest in Talent and Training to Keep Pace

The launch of a free tier and $100M training fund underscores the growing demand for skilled data and AI practitioners. Data leaders should plan for talent development and operational readiness to fully leverage evolving platforms.

The Road Ahead: Operationalizing AI the Datacoves Way

The Databricks AI Summit 2025 signals a fundamental shift: from scattered tools and isolated workflows to unified, governed, and AI-native platforms. It’s not just about building smarter systems; it’s about making those systems accessible, efficient, and scalable for the entire organization.

While these innovations are promising, putting them into practice takes more than vision; it requires infrastructure that balances speed, control, and usability.

That’s where Datacoves comes in.

Our platform accelerates the adoption of modern tools like dbt, Airflow, and emerging AI workflows, without the overhead of managing complex environments. We help teams operationalize best practices from day one, reducing total cost of ownership while enabling faster delivery, tighter governance, and AI readiness at scale. Datacoves supports Databricks, Snowflake, BigQuery, and any data platform with a dbt adapter. We believe in an open and interoperable feature where tools are integrated without increasing vendor lock-in. Talk to us to find out more.

Want to learn more? Book a demo with Datacoves.

Snowflake summit 2025
5 mins read

It is clear that Snowflake is positioning itself as an all-in-one platform—from data ingestion, to transformation, to AI. The announcements covered a wide range of topics, with AI mentioned over 60 times during the 2-hour keynote. While time will tell how much value organizations get from these features, one thing remains clear: a solid foundation and strong governance are essential to deliver on the promise of AI.

Snowflake Intelligence (Public Preview)

Conversational AI via natural language at ai.snowflake.com, powered by Anthropic/OpenAI LLMs and Cortex Agents, unifying insights across structured and unstructured data. Access is available through your account representative.  

Datacoves Take: Companies with strong governance—including proper data modeling, clear documentation, and high data quality—will benefit most from this feature. AI cannot solve foundational issues, and organizations that skip governance will struggle to realize its full potential.

Data Science Agent (Private Preview)

An AI companion for automating ML workflows—covering data prep, feature engineering, model training, and more.

Datacoves Take: This could be a valuable assistant for data scientists, augmenting rather than replacing their skills. As always, we'll be better able to assess its value once it's generally available.

Cortex AISQL (Public Preview)

Enables multimodal AI processing (like images, documents) within SQL syntax, plus enhanced Document AI and Cortex Search.

Datacoves Take: The potential here is exciting, especially for teams working with unstructured data. But given historical challenges with Document AI, we’ll be watching closely to see how this performs in real-world use cases.

AI Observability in Cortex AI (GA forthcoming)

No-code monitoring tools for generative AI apps, supporting LLMs from OpenAI (via Azure), Anthropic, Meta, Mistral, and others.

Datacoves Take: Observability and security are critical for LLM-based apps. We’re concerned that the current rush to AI could lead to technical debt and security risks. Organizations must establish monitoring and mitigation strategies now, before issues arise 12–18 months down the line.

Snowflake Openflow (GA on AWS)

Managed, extensible multimodal data ingestion service built on Apache NiFi with hundreds of connectors, simplifying ETL and change-data capture.

Datacoves Take: While this simplifies ingestion, GUI tools often hinder CI/CD and code reviews. We prefer code-first tools like DLT that align with modern software development practices. Note: Openflow requires additional AWS setup beyond Snowflake configuration.

dbt Projects on Snowflake (Public Preview)

Native dbt development, execution, monitoring with Git integration and AI-assisted code in Snowsight Workspaces.

Datacoves Take: While this makes dbt more accessible for newcomers, it’s not a full replacement for the flexibility and power of VS Code. Our customers rely on VS Code not just for dbt, but also for Python ingestion development, managing security as code, orchestration pipelines, and more. Datacoves provides an integrated environment that supports all of this—and more. See this walkthrough for details: https://www.youtube.com/watch?v=w7C7OkmYPFs

Enhanced Apache Iceberg support (Public/Private Preview)

Read/write Iceberg tables via Open Catalog, dynamic pipelines, VARIANT support, and Merge-on-Read functionality.

Datacoves Take: Interoperability is key. Many of our customers use both Snowflake and Databricks, and Iceberg helps reduce vendor lock-in. Snowflake’s support for Iceberg with advanced features like VARIANT is a big step forward for the ecosystem.

Modern DevOps extensions

Custom Git URLs, Terraform provider now GA, and Python 3.9 support in Snowflake Notebooks.

Datacoves Take: Python 3.9 is a good start, but we’d like to see support for newer versions. With PyPi integration, teams must carefully vet packages to manage security risks. Datacoves offers guardrails to help organizations scale Python workflows safely.

Snowflake Semantic Views (Public Preview)

Define business metrics inside Snowflake for consistent, AI-friendly semantic modeling.

Datacoves Take: A semantic layer is only as good as the underlying data. Without solid governance, it becomes another failure point. Datacoves helps teams implement the foundations—testing, deployment, ownership—that make semantic layers effective.

Standard Warehouse Gen2 (GA)

Hardware and performance upgrades delivering ~2.1× faster analytics for updates, deletes, merges, and table scans.

Datacoves Take: Performance improvements are always welcome, especially when easy to adopt. Still, test carefully—these upgrades can increase costs, and in some cases existing warehouses may still be the better fit.

SnowConvert AI

Free, automated migration of legacy data warehouses, BI systems, and ETL pipelines with code conversion and validation.

Datacoves Take: These tools are intriguing, but migrating platforms is a chance to rethink your approach—not just lift and shift legacy baggage. Datacoves helps organizations modernize with intention.

Cortex Knowledge Extensions (GA soon)

Enrich native apps with real-time content from publishers like USA TODAY, AP, Stack Overflow, and CB Insights.

Datacoves Take: Powerful in theory, but only effective if your core data is clean. Before enrichment, organizations must resolve entities and ensure quality.

Sharing of Semantic Models (Private Preview)

Internal/external sharing of AI-ready datasets and models, with natural language access across providers.

Datacoves Take: Snowflake’s sharing capabilities are strong, but we see many organizations underutilizing them. Effective sharing starts with trust in the data—and that requires governance and clarity.

Agentic Native Apps Marketplace

Developers can build and monetize Snowflake-native, agent-driven apps using Cortex APIs.

Datacoves Take: Snowflake has long promoted its app marketplace, but adoption has been limited. We’ll be watching to see if the agentic model drives broader use.

Improvements to Native App Framework

Versioning, permissions, app observability, and compliance badging enhancements.

Datacoves Take: We’re glad to see Snowflake adopting more software engineering best practices—versioning, observability, and security are all essential for scale.

Snowflake Adaptive Compute (Private Preview)

Auto-scaling warehouses with intelligent routing for performance optimization without cost increases.

Datacoves Take: This feels like a move toward BigQuery’s simplicity model. We’ll wait to see how it performs at scale. As always, test before relying on this in production.

Horizon Catalog Interoperability & Copilot (Private Preview)

Enhanced governance across Iceberg tables, relational DBs, dashboards, with natural-language metadata assistance.

Datacoves Take: Governance is core to successful data strategy. While Horizon continues to improve, many teams already use mature catalogs. Datacoves focuses on integrating metadata, ownership, and lineage across tools—not locking you into one ecosystem.

Security enhancements

Trust Center updates, new MFA methods, password protections, and account-level security improvements.

Datacoves Take: The move to enforce MFA and support for Passkeys is a great step. Snowflake is making it easier to stay secure—now organizations must implement these features effectively.

Enhanced observability tools

Upgrades to Snowflake Trail, telemetry for Openflow, and debug/monitor tools for Snowpark containers and GenAI agents/apps.

Datacoves Take: Observability is critical. Many of our customers build their own monitoring to manage costs and data issues. With these improvements, Snowflake is catching up—and Datacoves complements this with pipeline-level observability, including Airflow and dbt.

Read the full post from Snowflake here:
https://www.snowflake.com/en/blog/announcements-snowflake-summit-2025/

Hidden cost of no code ETL
5 mins read

The Hidden Costs of no code ETL Tools: 10 Reasons They Don’t Scale

"It looked so easy in the demo…"
— Every data team, six months after adopting a drag-and-drop ETL tool

If you lead a data team, you’ve probably seen the pitch: Slick visuals. Drag-and-drop pipelines. "No code required." Everything sounds great — and you can’t wait to start adding value with data!

At first, it does seem like the perfect solution: non-technical folks can build pipelines, onboarding is fast, and your team ships results quickly.

But our time in the data community has revealed the same pattern over and over: What feels easy and intuitive early on becomes rigid, brittle, and painfully complex later.

Let’s explore why no code ETL tools can lead to serious headaches for your data preparation efforts.

What Is ETL (and Why It Matters)?

Before jumping into the why and the how, let’s start with the what.

When data is created in its source systems it is never ready to be used for analysis as is. It always needs to be massaged and transformed for downstream teams to gather any insights from the data. That is where ETL comes in. ETL stands for Extract, Transform, Load. This is the process of moving data from multiple sources, reshaping (transforming) it, and loading it into a system where it can be used for analysis.

At its core, ETL is about data preparation:

  • Extracting raw data from different systems
  • Transforming it — cleaning, standardizing, joining, and applying business logic
  • Loading the refined data into a centralized destination like a data warehouse

Without ETL, you’re stuck with messy, fragmented, and unreliable data. Good ETL enables better decisions, faster insights, and more trustworthy reporting. Think of ETL as the foundation that makes dashboards, analytics, Data Science, Machine Learning, GenAI, and lead to data-driven decision-making even possible.

Data-driven decision making

Now the real question is how do we get from raw data to insights? That is where the topic of tooling comes into the picture. While this might be at a very high-level, we categorize tools into two categories: Code-based and no-code/low-code. Let’s look at these categories in a little more detail. 

What Are Code-Based ETL Tools?

Code-based ETL tools require analysts to write scripts or code to build and manage data pipelines. This is typically done with programming languages like SQL, Python, possibly with specialized frameworks, like dbt, tailored for data workflows.

Instead of clicking through a UI, users define the extraction, transformation, and loading steps directly in code — giving them full control over how data moves, changes, and scales.

Common examples of code-based ETL tooling include dbt (data build tool), SQLMesh, Apache Airflow, and custom-built Python scripts designed to orchestrate complex workflows.

While code-based tools often come with a learning curve, they offer serious advantages:

  • Greater flexibility to handle complex business logic
  • Better scalability as data volumes and pipeline complexity grow
  • Stronger maintainability through practices like version control, testing, and modular development

Most importantly, code-based systems allow teams to treat pipelines like software, applying engineering best practices that make systems more reliable, auditable, and adaptable over time.

Building and maintaining robust ETL pipelines with code requires up-front work to set up CI/CD and developers who understand SQL or Python. Because of this investment in expertise, some teams are tempted to explore whether the grass is greener on the other side with no-code or low-code ETL tools that promise faster results with less engineering complexity. No hard-to-understand code, just drag and drop via nice-looking UIs. This is certainly less intimidating than seeing a SQL query.

What Are No-Code ETL Tools?

As you might have already guessed, no-code ETL tools let users build data pipelines without writing code. Instead, they offer visual interfaces—typically drag-and-drop—that “simplify” the process of designing data workflows.

What Are No-Code ETL Tools?

These tools aim to make data preparation accessible to a broader audience reducing complexity by removing coding. They create the impression that you don't need skilled engineers to build and maintain complex pipelines, allowing users to define transformations through menus, flowcharts, and configuration panels—no technical background required.

However, this perceived simplicity is misleading. No-code platforms often lack essential software engineering practices such as version control, modularization, and comprehensive testing frameworks. This can lead to a buildup of technical debt, making systems harder to maintain and scale over time. As workflows become more complex, the initial ease of use can give way to a tangled web of dependencies and configurations, challenging to untangle without skilled engineering expertise. Additional staff is needed to maintain data quality, manage growing complexity, and prevent the platform from devolving into a disorganized state. Over time, team velocity decreases due to layers of configuration menus.

Popular no-code ETL tools include Matillion, Talend, Azure Data Factory(ADF), Informatica, Talend, and Alteryx. They promise minimal coding while supporting complex ETL operations. However, it's important to recognize that while these tools can accelerate initial development, they may introduce challenges in long-term maintenance and scalability.

To help simplify why best-in-class orginazations typically avoid no-code tools, we've come up with 10 reasons that highlight their limitations.

🔟 Reasons GUI-Based ETL Tools Don’t Scale

1. Version control is an afterthought

Most no-code tools claim Git support, but it's often limited to unreadable exports like JSON or XML. This makes collaboration clunky, audits painful, and coordinated development nearly impossible.

Bottom Line: Scaling a data team requires clean, auditable change management — not hidden files and guesswork.

2. Reusability is limited

Without true modular design, teams end up recreating the same logic across pipelines. Small changes become massive, tedious updates, introducing risk and wasting your data team’s time. $$$

Bottom Line: When your team duplicates effort, innovation slows down.

3. Debugging is frustrating

When something breaks, tracing the root cause is often confusing and slow. Error messages are vague, logs are buried, and troubleshooting feels like a scavenger hunt. Again, wasting your data team’s time.

Bottom Line: Operational complexity gets hidden behind a "simple" interface — until it’s too late and it starts costing you money.

4. Testing is nearly impossible

Most no-code tools make it difficult (or impossible) to automate testing. Without safeguards, small changes can ripple through your pipelines undetected. Users will notice it in their dashboards before your data teams have their morning coffee.

Bottom Line: If you can’t trust your pipelines, you can’t trust your dashboards or reports.

5. They eventually require code anyway

As requirements grow, "no-code" often becomes "some-code." But now you’re writing scripts inside a platform never designed for real software development. This leads to painful uphill battles to scale.

Bottom Line: You get the worst of both worlds: the pain of code, without the power of code.

6. Poor team collaboration

Drag-and-drop tools aren’t built for teamwork at scale. Versioning, branching, peer review, and deployment pipelines — the basics of team productivity — are often afterthoughts. This makes it difficult for your teams to onboard, develop and collaborate. Less innovation, less insights, and more money to deliver insights!

Bottom Line: Without true team collaboration, scaling people becomes as hard as scaling data.

7. Vendor lock-in is real

Your data might be portable, but the business logic that transforms it often isn't. Migrating away from a no-code tool can mean rebuilding your entire data stack from scratch. Want to switch tooling for best-in-class tools as the data space changes? Good luck. 

Bottom Line: Short-term convenience can turn into long-term captivity.

8. Performance problems sneak up on you

When your data volume grows, you often discover that what worked for a few million rows collapses under real scale. Because the platform abstracts how work is done, optimization is hard — and costly to fix later. Your data team will struggle to lower that bill more than they would with fine tune code-based tools. 

Bottom Line: You can’t improve what you can’t control.

9. Developers don’t want to touch them

Great analysts prefer tools that allow precision, performance tuning, and innovation. If your environment frustrates them, you risk losing your most valuable technical talent. Onboarding new people is expensive; you want to keep and cultivate the talent you do have. 

Bottom Line: If your platform doesn’t attract builders, you’ll struggle to scale anything.

10. They trade long-term flexibility for short-term ease

No-code tools feel fast at the beginning. Setup is quick, results come fast, and early wins are easy to showcase. But as complexity inevitably grows, you’ll face rigid workflows, limited customization, and painful workarounds. These tools are built for simplicity, not flexibility and that becomes a real problem when your needs evolve. Simple tasks like moving a few fields or renaming columns stay easy, but once you need complex business logic, large transformations, or multi-step workflows, it is a different matter. What once sped up delivery now slows it down, as teams waste time fighting platform limitations instead of building what the business needs.

Bottom Line: Early speed means little if you can’t sustain it. Scaling demands flexibility, not shortcuts.

Conclusion

No-code ETL tools often promise quick wins: rapid deployment, intuitive interfaces, and minimal coding. While these features can be appealing, especially for immediate needs, they can introduce challenges at scale.

As data complexity grows, the limitations of no-code solutions—such as difficulties in version control, limited reusability, and challenges in debugging—can lead to increased operational costs and hindered team efficiency. These factors not only strain resources but can also impact the quality and reliability of your data insights. 

It's important to assess whether a no-code ETL tool aligns with your long-term data strategy. Always consider the trade-offs between immediate convenience and future scalability. Engaging with your data team to understand their needs and the potential implications of tool choices can provide valuable insights. 

What has been your experience with no-code ETL tools? Have they met your expectations, or have you encountered unforeseen challenges?

What is microsoft fabric
5 mins read

There's a lot of buzz around Microsoft Fabric these days. Some people are all-in, singing its praises from the rooftops, while others are more skeptical, waving the "buyer beware" flag. After talking with the community and observing Fabric in action, we're leaning toward caution. Why? Well, like many things in the Microsoft ecosystem, it's a jack of all trades but a master of none. Many of the promises seem to be more marketing hype than substance, leaving you with "marketecture" instead of solid architecture. While the product has admirable, lofty goals, Microsoft has many wrinkles to iron out.

In this article, we'll dive into 10 reasons why Microsoft Fabric might not be the best fit for your organization in 2025. By examining both the promises and the current realities of Microsoft Fabric, we hope to equip you with the information needed to make an informed decision about its adoption.

What is Microsoft Fabric?

Microsoft Fabric is marketed as a unified, cloud-based data platform developed to streamline data management and analytics within organizations. Its goal is to integrate various Microsoft services into a single environment and to centralize and simplify data operations.

This means that Microsoft Fabric is positioning itself as an all-in-one analytics platform designed to handle a wide range of data-related tasks. A place to handle data engineering, data integration, data warehousing, data science, real-time analytics, and business intelligence. A one stop shop if you will. By consolidating these functions, Fabric hopes to provide a seamless experience for organizations to manage, analyze, and gather insights from their data.

Core Components of Microsoft Fabric

  • OneLake: OneLake is the foundation of Microsoft Fabric, serving as a unified data lake that centralizes storage across Fabric services. It is built on Delta Lake technology and leverages Azure Blob Storage, similar to how Apache Iceberg is used for large-scale cloud data management
  • Synapse Data Warehouse: Similar to Amazon Redshift this provides storage and management for structured data. It supports SQL-based querying and analytics, aiming to facilitate data warehousing needs.
  • Synapse Data Engineering: Compute engine based on Apache Spark, similar to Databricks' offering. It is built on Apache Spark and is intended to support tasks such as data cleaning, transformation, and feature engineering.
  • Azure Data Factory: A tool for pipeline orchestration and data loading which is also part of Synapse Data Engineering
  • Synapse Data Science: Similar to Jupiter Notebooks that can only run on Azure Spark. It is designed to support data scientists in developing predictive analytics and AI solutions by leveraging Azure ML and Azure Spark services.
  • Synapse Real-Time Analytics: Enables the analysis of streaming data from various sources including Kafka, Kinesis, and CDC sources.
  • Power BI: This is a BI (business intelligence) tool like Tableau tool designed to create data visualizations and dashboards.
Core Components of Microsoft Fabric

Fabric presents itself as an all-in-one solution, but is it really? Let’s break down where the marketing meets reality.

10 Reasons It’s Still Not the Right Choice in 2025

While Microsoft positions Fabric is making an innovative step forward, much of it is clever marketing and repackaging of existing tools. Here’s what’s claimed—and the reality behind these claims:

1. Fragmented User Experience, Not True Unification

Claim: Fabric combines multiple services into a seamless platform, aiming to unify and simplify workflows, reduce tool sprawl, and make collaboration easier with a one-stop shop.

Reality:

  • Rebranded Existing Services: Fabric is mainly repackaging existing Azure services under a new brand. For example, Fabric bundles Azure Data Factory (ADF) for pipeline orchestration and Azure Synapse Analytics for traditional data warehousing needs and Azure Spark for distributed workloads. While there are some enhancements to synapse to synchronize data from OneLake, the core functionalities remain largely unchanged. PowerBI is also part of Fabric  and this tool has existed for years as have notebooks under the Synapse Data Science umbrella.
  • Steep Learning Curve and Complexity: Fabric claims to create a unified experience that doesn’t exist in other platforms, but it just bundles a wide range of services—from data engineering to analytics—and introduces new concepts (like proprietary query language, KQL which is only used in the Azure ecosystem). Some tools are geared to different user personas such as ADF for data engineers and Power BI for business analysts, but to “connect” an end-to-end process, users would need to interact with different tools. This can be overwhelming, particularly for teams without deep Microsoft expertise. Each tool has its own unique quirks and even services that have functionality overlap don’t work exactly the same way to do the same thing. This just complicates the learning process and reduces overall efficiency.

2. Performance Bottlenecks & Throttling Issues

Claim: Fabric offers a scalable and flexible platform.


Reality: In practice, managing scalability in Fabric can be difficult. Scaling isn’t a one‑click, all‑services solution—instead, it requires dedicated administrative intervention. For example, you often have to manually pause and un-pause capacity to save money, a process that is far from ideal if you’re aiming for automation. Although there are ways to automate these operations, setting up such automation is not straightforward. Additionally, scaling isn’t uniform across the board; each service or component must be configured individually, meaning that you must treat them on a case‑by‑case basis. This reality makes the promise of scalability and flexibility a challenge to realize without significant administrative overhead.  

3. Capacity-Based Pricing Creates Cost Uncertainty

Claim: Fabric offers predictable, cost-effective pricing.

Reality: While Fabric's pricing structure appears straightforward, several hidden costs and adoption challenges can impact overall expenses and efficiency:  

  • Cost uncertainty: Microsoft Fabric uses a capacity-based pricing model that requires organizations to purchase predefined Capacity Units (CUs).  Organizations need to carefully assess their workload requirements to optimize resource allocation and control expenses. Although a pay-as-you-go (PAYG) option is available, it often demands manual intervention or additional automation to adjust resources dynamically.  This means organizations often need to overprovision compute power to avoid throttling, leading to inefficiencies and increased costs. The problem is you pay for what you think you will use and get a 40% discount. If you don’t use all of the capacity, then there are wasted capacity. If you go over capacity, you can do configure PAYG (pay as you go) but it’s at full price. Unlike true serverless solutions, you pay for allocated capacity regardless of actual usage. This isn’t flexible like the cloud was intended to be.  👎
  • Throttling and Performance Degradation: Exceeding purchased capacity can result in throttling, causing degraded performance. To prevent this, organizations might feel compelled to purchase higher capacity tiers, further escalating costs.
  • Visibility and Cost Management: Users have reported challenges in monitoring and predicting costs due to limited visibility into additional expenses. This lack of transparency necessitates careful monitoring and manual intervention to manage budgets effectively.  
  • Adoption and Training Time: It’s important to note that implementing Fabric requires significant time investment in training and adapting existing workflows. While this is the case with any new platform, Microsoft is notorious for complexity in their tooling and this can lead to longer adoption periods, during which productivity may temporarily decline.

All this to say that the pricing model is not good unless you can predict with great accuracy exactly how much you will spend every single day, and who knows that? Check out this article on the hidden cost of fabric which goes into detail and cost comparisons.

4. Limited Compatibility with Non-Microsoft Tools

Claim: Fabric supports a wide range of data tools and integrations.

Reality: Fabric is built around a tight integration with other Fabric services and Microsoft tools such as Office 365 and Power BI, making it less ideal for organizations that prefer a “best‑of‑breed” approach (or rely on tools like Tableau, Looker, open-source solutions like Lightdash, or other non‑Microsoft solutions), this can severely limit flexibility and complicate future migrations.

While third-party connections are possible, they don’t integrate as smoothly as those in the MS ecosystem like Power BI, potentially forcing organizations to switch tools just to make Fabric work.

5. Poor DataOps & CI/CD Support

Claim: Fabric simplifies automation and deployment for data teams by supporting modern DataOps workflows.


Reality: Despite some scripting support, many components remain heavily UI‑driven. This hinders full automation and integration with established best practices for CI/CD pipelines (e.g., using Terraform, dbt, or Airflow). Organizations that want to mature data operations with agile DataOps practices find themselves forced into manual workarounds and struggle to integrate Fabric tools into their CI/CD processes. Unlike tools such as dbt, there is not built-in Data Quality or Unit Testing, so additional tools would need to be added to Fabric to achieve this functionality.

6. Security Gaps & Compliance Risks

Claim: Microsoft Fabric provides enterprise-grade security, compliance, and governance features.

Reality: While Microsoft Fabric offers robust security measures like data encryption, role-based access control, and compliance with various regulatory standards, there are some concerns organizations should consider.

One major complaint is that access permissions do not always persist consistently across Fabric services, leading to unintended data exposure.

For example, users can still retrieve restricted data from reports due to how Fabric handles permissions at the semantic model level. Even when specific data is excluded from a report, built-in features may allow users to access the data, creating compliance risks and potential unauthorized access. Read more: Zenity - Inherent Data Leakage in Microsoft Fabric.

While some of these security risks can be mitigated, they require additional configurations and ongoing monitoring, making management more complex than it should be. Ideally, these protections should be unified and work out of the box rather than requiring extra effort to lock down sensitive data.

7. Lack of Maturity & Changes that Disrupt Workflow

Claim: Fabric is presented as a mature, production-ready analytics platform.

Reality: The good news for Fabric is that it is still evolving. The bad news is, it's still evolving. That evolution impacts users in several ways:  

  • Frequent Updates and Unstable Workflows: Many features remain in preview, and regular updates can sometimes disrupt workflows or introduce unexpected issues. Users have noted that the platform’s UI/UX is continually changing, which can impact consistency in day-to-day operations. Just when you figure out how to do something, the buttons change. 😤
  • Limited Features: Several functionalities are still in preview or implementation is still in progress. For example, dynamic connection information, Key Vault integration for connections, and nested notebooks are not yet fully implemented. This restricts the platform’s applicability in scenarios that rely on these advanced features.
  • Bugs and Stability Issues: A range of known issues—from data pipeline failures to problems with Direct Lake connections—highlights the platform’s instability. These bugs can make Fabric unpredictable for mission-critical tasks. One user lost 3 months of work!

8. Black Box Automation & Limited Customization

Claim: Fabric automates many complex data processes to simplify workflows.

Reality: Fabric is heavy on abstractions and this can be a double‑edged sword. While at first it may appear to simplify things, these abstractions lead to a lack of visibility and control. When things go wrong it is hard to debug and it may be difficult to fine-tune performance or optimize costs.

For organizations that need deep visibility into query performance, workload scheduling, or resource allocation, Fabric lacks the granular control offered by competitors like Databricks or Snowflake.

9. Limited Resource Governance and Alerting

Claim: Fabric offers comprehensive resource governance and robust alerting mechanisms, enabling administrators to effectively manage and troubleshoot performance issues.  

Reality: Fabric currently lacks fine-grained resource governance features making it challenging for administrators to control resource consumption and mitigate issues like the "noisy neighbor" problem, where one service consumes disproportionate resources, affecting others.  

The platform's alerting mechanisms are also underdeveloped. While some basic alerting features exist, they often fail to provide detailed information about which processes or users are causing issues. This can make debugging an absolute nightmare. For example, users have reported challenges in identifying specific reports causing slowdowns due to limited visibility in the capacity metrics app. This lack of detailed alerting makes it difficult for administrators to effectively monitor and troubleshoot performance issues, often needing the adoption of third-party tools for more granular governance and alerting capabilities. In other words, not so all in one in this case.

10. Missing Features & Gaps in Functionality

Claim: Fabric aims to be an all-in-one platform that covers every aspect of data management.  

Reality: Despite its broad ambitions, key features are missing such as:

  • Geographical Availability: Fabric's data warehousing does not support multiple geographies, which could be a constraint for global organizations seeking localized data storage and processing.  
  • Garbage Collection: Parquet files that are no longer needed are not automatically removed from storage, potentially leading to inefficient storage utilization.  

While these are just a couple of examples it's important to note that missing features will compel users to seek third-party tools to fill the gaps, introducing additional complexities.  Integrating external solutions is not always straight forward with Microsoft products and often introduces a lot of overhead.  Alternatively, users will have to go without the features and create workarounds or add more tools which we know will lead to issues down the road.  

Conclusion

Microsoft Fabric promises a lot, but its current execution falls short. Instead of an innovative new platform, Fabric repackages existing services, often making things more complex rather than simpler.

That’s not to say Fabric won’t improve—Microsoft has the resources to refine the platform. But as of 2025, the downsides outweigh the benefits for many organizations.

If your company values flexibility, cost control, and seamless third-party integrations, Fabric may not be the best choice. There are more mature, well-integrated, and cost-effective alternatives that offer the same features without the Microsoft lock-in.

Time will tell if Fabric evolves into the powerhouse it aspires to be. For now, the smart move is to approach it with a healthy dose of skepticism.

👉 Before making a decision, thoroughly evaluate how Fabric fits into your data strategy. Need help assessing your options? Check out this data platform evaluation worksheet.  

Open source databases
5 mins read

SQL databases are great for organizing, storing, and retrieving structured data essential to modern business operations. These databases use Structured Query Language (SQL), a gold standard tool for managing and manipulating data, which is universally recognized for its reliability and robustness in handling complex queries and vast datasets.

SQL is so instrumental to database management that databases are often categorized based on their use of SQL. This has led to the distinction between SQL databases, which use Structured Query Language for managing data, and NoSQL databases, which do not rely on SQL and are designed for handling unstructured data and different data storage models. If you are looking to compare SQL databases or just want to deepen your understanding of these essential tools, this article is just for you.

What is an Open Source Database?  

Open source databases are software systems whose source code is publicly available for anyone to view, modify, and enhance. This article covers strictly open source SQL databases. Why? Because we believe that they bring additional advantages that are reshaping the data management space. Unlike proprietary databases that can be expensive and restrictive, open source databases are developed through collaboration and innovation at their core. This not only eliminates licensing fees but also creates a rich environment of community-driven enhancements. Contributors from around the globe work to refine and evolve these databases, ensuring they are equipped to meet the evolving demands of the data landscape.  

Why Use Open Source Databases?

Cost-effectiveness: Most open source databases are free to use, which can significantly reduce the total cost of ownership.

Flexibility and Customization: Users can modify the database software to meet their specific needs, a benefit not always available with proprietary software.

Community Support: Robust communities contribute to the development and security of these databases, often releasing updates and security patches faster than traditional software vendors.

OLTP vs OLAP

When selecting a database, it is important to determine your primary use case. Are you frequently creating, updating, or deleting data? Or do you need to analyze large volumes of archived data that doesn't change often? The answer should guide the type of database system you choose to implement.  

In this article we will be touching on OLTP and OLAP open source SQL databases. These databases are structured in different ways depending on the action they wish to prioritize analytics, transactions, or a hybrid of the two.  

What is OLTP?

OLTP or Online Transaction Processing databases are designed to manage and handle high volumes of small transactions such as inserting, updating, and/or deleting small amounts of data in a database. OLTP databases can handle real-time transactional tasks due to their emphasis on speed and reliability. The design of OLTP databases is highly normalized to reduce redundancy and optimizes update/insert/delete performance. OLTP databases can be used for analytics but this is not recommended since better databases suited for analytics exist.

Characteristics of OLTP:

  • Handles large numbers of transactions by many users.
  • Operations are typically simple (e.g., updating a record or retrieving specific record details).
  • Focus on quick query processing and maintaining data integrity in multi-access environments.
  • Data is highly normalized.

When to use OLTP?

Use OLTP if you are developing applications that require fast, reliable, and secure transaction processing. Common use cases include but are not limited to:  

E-commerce: Order placement, payment processing, customer profile management, and shopping cart updates.

Banking: Account transactions, loan processing, ATM operations, and fraud detection.

Customer Relationship Management (CRM): Tracking customer interactions, updating sales pipelines, managing customer support tickets, and monitoring marketing campaigns.

What is OLAP?

OLAP or Online Analytical Processing databases are designed to perform complex analyses and queries on large volumes of data. They are optimized for read-heavy scenarios where queries are often complicated and involve aggregations such as sums and averages across many datasets. OLAP databases are typically denormalized, which improves query performance but come with the added expense of storage space and slower update speeds.

Characteristics of OLAP:

  • Designed for analysis and reporting functions.
  • Queries are complex and involve large volumes of data.
  • Focus on maximizing query speed across large datasets.
  • Data may be denormalized to expedite query processing.

When to use OLAP?

Use OLAP if you need to perform complex analysis on large datasets to gather insights and support decision making. Common use cases include but are not limited to:

Retail Sales Data Analysis: A retail chain consolidates nationwide sales data to analyze trends, product performance, and customer preferences.

Corporate Performance Monitoring: A multinational uses dashboards to track financial, human resources, and operational metrics for strategic decision-making.

Financial Analysis and Risk Management: A bank leverages an OLAP system for financial forecasting and risk analysis using complex data-driven calculations.

In practice, many businesses will use both types of systems: OLTP systems to handle day-to-day transactions and OLAP systems to analyze data accumulated from these transactions for business intelligence and reporting purposes.

Now that we are well versed in OLTP vs OLAP, let's dive into our open source databases!  

Open Source OLTP Databases

PostgreSQL

A row-oriented database, often considered the world’s most advanced open source database. PostgreSQL offers extensive features designed to handle a range of workloads from single machines to data warehouses or web services with many concurrent users.

Best Uses: Enterprise applications, complex queries, handling large volumes of data.

SQLite

SQLite is a popular choice for embedded database applications, being a self-contained, high-reliability, and full-featured SQL database engine. This database is a File-based database which means that they store data in a file (or set of files) on disk, rather than requiring a server-based backend. This approach has several key characteristics and advantages such as being lightweight, portable, easy to use, and self-contained.

Best Uses: Mobile applications, small to medium-sized websites, and desktop applications.

MariaDB

A columnar database and offshoot of MySQL. MariaDB was created by the original developers of MySQL after concerns over its acquisition by Oracle. It is widely respected for its performance and robustness.

Best Uses: Web-based applications, cloud environments, or as a replacement for MySQL.

Firebird

Firebird is a flexible relational database offering many ANSI SQL standard features that run on Linux, Windows, and a variety of Unix platforms. This database can handle a hybrid approach of OLTP and OLAP due to its multi-generational architecture and because readers do not block writers when accessing the same data.

Best Uses: Small to medium enterprise applications, particularly where complex, customizable database systems are required.

Open Source OLAP Databases

ClickHouse

Known for its speed, ClickHouse is an open-source column-oriented, File-based database management system that is great at real-time query processing over large datasets.  As mentioned earlier in the article, File-based databases bring many benefits They make use of data compression, disk storage of data, parallel processing on multiple cores, distributed processing on multiple servers and more.

Best Uses: Real-time analytics and managing large volumes of data.

DuckDB

Similar to SQLite, DuckDB is an embedded file-based database, however, DuckDB is a column-oriented database that is designed to execute analytical SQL queries fast and efficiently. This database has no dependencies making it a simple, efficient, and portable database. Since it is file-based this means DuckDB runs embedded within the host process, which allows for high-speed data transfer during analytics.

Best Uses: Analytical applications that require fast, in-process SQL querying capabilities.

StarRocks

StarRocks is a performance-oriented, columnar distributed data warehouse designed to handle real-time analytics. StarRocks also supports hybrid row-column storage. It is known for its blazing-fast massively parallel processing (MPP) abilities. Data can be ingested at a high speed and updated and deleted in real time making it perfect for real-time analytics on fresh data.

Best Uses: Real-time analytical processing on large-scale datasets.

Doris

Doris is an MPP-based, column-oriented data warehouse, aimed at providing high performance and real-time analytical processing. Doris can support highly concurrent point query scenarios and high-throughput complex analytic scenarios. Its high speed and ease of use despite working with large amounts of data make it a great option.

Best Uses: Real-time OLAP applications and scenarios demanding fast data processing and complex aggregation.

Trino

Even though Trino is not a database, but rather a query engine that allows you to query your databases, we felt it is a powerful addition to this open source list. Originally developed by Facebook and known as PrestoSQL, Trino is designed to query large data warehouses and big data systems rapidly. Since it is great for working with terabytes or petabytes of data it is an alternative to tools such as Hive or Pig. However, it can also operate on traditional relational databases and other data sources such as Cassandra. One major benefit is that Trino allows you to perform queries across different databases and data sources. This is known as query federation.  

Best Uses: Distributed SQL querying for big data solutions.

Citus

While this is not a separate open source database, we felt it was a good addition to the list because Citus is an extension to PostgreSQL that that transforms your Postgres database into a distributed database. This enables it to scale horizontally.

Best Uses: Scalable PostgreSQL applications, especially those needing to handle multi-tenant applications and real-time analytics over large datasets.

Conclusion  

Open source SQL databases provide a variety of options for organizations and developers seeking flexible, cost-effective solutions for data management. Whether your needs are for handling large data sets, real-time analytics, or robust enterprise applications, there is likely an open source database out there for you.

Get our free ebook dbt Cloud vs dbt Core

Get the PDF
Download pdf