Datacoves blog

Learn more about dbt Core, ELT processes, DataOps,
modern data stacks, and team alignment by exploring our blog.
Build vs buy analytics
dbt alternatives
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Enterprise Transformation Guide
5 mins read

dbt (data build tool) is a SQL-based transformation framework that turns raw data into trusted, analytics-ready datasets directly inside your data warehouse. It brings software engineering discipline to analytics: version control, automated testing, CI/CD, and auto-generated documentation. dbt handles the "T" in ELT. It does not extract, load, or move data.

What dbt Does: The Transformation Layer in ELT

dbt focuses exclusively on the transformation layer of ELT (Extract, Load, Transform). Unlike traditional ETL tools that handle the entire pipeline, dbt assumes data already exists in your warehouse. Ingestion tools like Informatica, Azure Data Factory, or Fivetran load the raw data. dbt transforms it into trusted, analytics-ready datasets.

A dbt project consists of SQL files called models. Each model is a SELECT statement that defines a transformation. When you run dbt, it compiles these models, resolves dependencies, and executes the SQL directly in your warehouse. The results materialize as tables or views. Data never leaves your warehouse.

Example: A Simple dbt Model (models/marts/orders_summary.sql)

SELECT
 customer_id,
 COUNT(*) AS total_orders,
 SUM(order_amount) AS lifetime_value,
 MIN(order_date) AS first_order_date
FROM {{ ref('stg_orders') }}
GROUP BY customer_id

The {{ref('stg_orders')}} syntax creates an explicit dependency. dbt uses these references to build a dependency graph (DAG) of your entire pipeline, ensuring models run in the correct order.

A graphic shows dbt builds a dependency graph

For large datasets, dbt supports incremental models that process only new or changed data. This keeps pipelines fast and warehouse costs controlled as data volumes grow. 

With dbt, teams can: 

  • Write transformations as version-controlled SQL 
  • Define explicit dependencies between models 
  • Enforce data quality with automated tests 
  • Generate documentation and lineage automatically 
  • Deploy changes safely using CI/CD workflows 
  • Trace issues back to specific commits 

dbt handles the "T" in ELT. It does not extract, load, or move data between systems. 

How dbt Fits the Enterprise Stack

Layer Role Example Tools
Ingestion Extract and load raw data Informatica, Azure Data Factory, Fivetran, dlt
Transformation Apply business logic dbt
Orchestration Schedule and coordinate Airflow
Consumption Analyze and visualize Tableau, Power BI
A graphic shows where dbt fits in the enterprise data stack

What dbt Is Not 

Misaligned expectations are a primary cause of failed dbt implementations. Knowing what dbt does not do matters as much as knowing what it does.

dbt Is NOT What to Use Instead
An ETL/ELT ingestion tool Informatica, Azure Data Factory, Fivetran, dlt, custom scripts
A scheduler or orchestrator Airflow, Dagster, Prefect, Control-M
A data warehouse Snowflake, BigQuery, Redshift, Databricks, MS Fabric
A BI or reporting tool Looker, Tableau, Power BI, Qlik, Omni
A data catalog Atlan, Alation, DataHub, Collibra (dbt generates metadata)
A fix for organizational problems Governance frameworks, clear ownership, aligned incentives

This separation of concerns is intentional. By focusing exclusively on transformation, dbt allows enterprises to evolve their ingestion, orchestration, and visualization layers independently. You can swap Informatica for Azure Data Factory or migrate from Redshift to Snowflake without rewriting your business logic.

A common mistake: treating dbt as a silver bullet. 

dbt is a tool, not a strategy. Organizations with unclear data ownership, no governance framework, or misaligned incentives will not solve those problems by adopting dbt. They will simply have the same problems with versioned SQL.

For a deeper comparison, see dbt vs Airflow: Which data tool is best for your organization? 

Why Enterprises Standardize on dbt

Over 30,000+ companies use dbt weekly, including JetBlue, HubSpot, Roche, J&J, Block, and Nasdaq dbt Labs, 2024 State of Analytics Engineering

Enterprise adoption of dbt has accelerated because it solves problems that emerge specifically at scale. Small teams can manage transformation logic in spreadsheets and ad hoc scripts. At enterprise scale, that approach creates compounding risk.

Who Uses dbt in Production 

dbt has moved well beyond startups into regulated, enterprise environments: 

Life Sciences: Roche, Johnson & Johnson (See how J&J modernized their data stack with dbt), and pharmaceutical companies with strict compliance requirements 

  • Financial Services: Block (formerly Square), Nasdaq, and major banks processing billions of transactions 
  • Technology: GitLab, HubSpot, and companies operating data platforms at massive scale 

These are not proof-of-concept deployments. These are production systems powering executive dashboards, regulatory reporting, and customer-facing analytics.

The Problem: Scattered Business Logic 

Without a standardized transformation layer, enterprise analytics fails in predictable ways: 

  • Business logic sprawls across BI tools, Python scripts, stored procedures, and ad hoc queries 
  • The same metric (revenue, active user, churn rate) gets defined differently by different teams 
  • Data quality issues surface in executive dashboards, not in development 
  • Changes to upstream data silently break downstream reports 
  • New analysts spend weeks understanding tribal knowledge before contributing 
  • Auditors cannot trace how reported numbers were calculated
Organizations report 45% of analyst time is spent finding, understanding, and fixing data quality issues Gartner Data Quality Market Survey, 2023

The Solution: Transformation as Code 

dbt addresses these problems by treating transformation logic as production code:

Without dbt With dbt
Business logic in dashboards Business logic in version-controlled SQL
Metric definitions vary by team Single source of truth in core models
Quality issues found in production Automated tests catch issues in CI
Changes are risky and manual Changes reviewed and deployed via PR
Onboarding takes weeks Self-documenting codebase with lineage
Audit requires archaeology Full git history of every transformation
Scattered logic vs. governed transformation

The dbt Ecosystem 

One of the most underappreciated reasons enterprises adopt dbt is leverage. dbt is not just a transformation framework. It sits at the center of a broad ecosystem that reduces implementation risk and accelerates delivery.

dbt Packages 

dbt packages are reusable projects available at hub.getdbt.com. They provide pre-built tests, macros, and modeling patterns that let teams leverage proven approaches instead of building from scratch. 

Popular packages include: 

  • dbt-utils: Generic tests and utility macros used by most dbt projects 
  • dbt-expectations: Data quality testing inspired by Great Expectations 
  • dbt-audit-helper: Compare model results during refactoring 
  • Source-specific packages for HubSpot, Salesforce, Stripe, and dozens of other systems 

Using packages signals operational maturity. It reflects a preference for shared, tested patterns over bespoke solutions that create maintenance burden. Mature organizations also create internal packages they can share across teams to leverage learnings across the company.

Integrations with Enterprise Tools 

dbt integrates with the broader data stack through its rich metadata (lineage, tests, documentation): 

  • Data Catalogs: Atlan, Alation, DataHub ingest dbt metadata for discovery and governance 
  • Data Observability: Monte Carlo, Bigeye, and Elementary use dbt context for smarter alerting 
  • BI and Semantic Layer: Looker, Tableau, and other semantic layers for consistent metrics 
  • Orchestration: Airflow, Dagster, and Prefect trigger and monitor dbt runs 
  • CI/CD: GitHub Actions, GitLab CI, Jenkins, Azure DevOps for automated testing and deployment 

Because dbt produces machine-readable metadata, it acts as a foundation that other tools build on. This makes dbt a natural anchor point for enterprise data platforms.

The dbt Community

The dbt Slack community has 100,000+ members sharing patterns, answering questions, and debugging issues dbt Labs Community Stats, 2024

For enterprises, community size matters because: 

  • New hires often already know dbt, reducing onboarding time and training costs 
  • Common problems have well-documented solutions and patterns 
  • Best practices are discovered and shared quickly across organizations 
  • It reduces reliance on vendor documentation or expensive consultants 

When you adopt dbt, you are not just adopting a tool. You are joining an ecosystem with momentum.

How dbt Works: The Development Workflow 

A typical dbt workflow follows software engineering practices familiar to any developer: 

  1. Write a model: Create a SQL file using SELECT statements and dbt's ref() function for dependencies. 
  1. Test locally: Run dbt run to execute models against a development schema. Run dbt test to validate data quality. 
  1. Document: Add descriptions to models and columns in YAML files. dbt generates a searchable documentation site automatically. 
  1. Submit for review: Open a pull request. CI pipelines compile models, run tests, and check for standards compliance. 
  1. Deploy to production: After approval, changes merge to main and deploy to production schemas via CD pipelines. 
  1. Orchestrate: Airflow (or another orchestrator) schedules dbt runs, coordinates with upstream ingestion, and handles retries. 
models: 
  - name: orders_summary 
    description: "Customer-level order aggregations" 
    columns: 
  	- name: customer_id 
        description: "Primary key from source system" 
        tests: 
      	- unique 
      	- not_null 
  	- name: lifetime_value 
        description: "Sum of all order amounts in USD" 

What dbt Delivers for Enterprise Leaders

For executives and data leaders, dbt is less about SQL syntax and more about risk reduction and operational efficiency. 

Measurable Outcomes 

Organizations implementing dbt with proper DataOps practices report: 

  • Dramatic productivity gains (Gartner predicts DataOps-guided teams will be 10x more productive by 2026) 
  • Faster incident resolution through lineage-based root cause analysis (from hours to minutes) 
  • Shorter onboarding with a self-documenting codebase (vs. 3+ months industry average) 
  • Elimination of metric drift where teams report different numbers for the same KPI 
  • Audit-ready transformation history with full traceability to code changes

Governance and Compliance 

dbt supports enterprise governance requirements by making transformations explicit and auditable: 

  • Every transformation is version-controlled with full commit history 
  • Code review processes enforce four-eyes principles on data logic changes 
  • Lineage shows exactly how sensitive data flows through the pipeline 
  • Test results provide evidence of data quality controls for auditors

DIY vs. Managed: The Infrastructure Decision 

The question for enterprise leaders is not "Should we use dbt?" The question is "How do we operate dbt as production infrastructure?" 

dbt Core is open source, and many teams start by running it on a laptop. But open source looks free the way a free puppy looks free. The cost is not in the acquisition. The cost is in the care and feeding. 

For a detailed comparison, see Build vs Buy Analytics Platform: Hosting Open-Source Tools

The hard part is not installing dbt. The complexity comes from everything around it: 

  • Managing consistent environments across development, CI, and production 
  • Operating Airflow for orchestration and retry logic 
  • Handling secrets, credentials, and access controls 
  • Coordinating upgrades across dbt, Airflow, and dependencies 
  • Supporting dozens of developers working safely in parallel 

Building your own dbt platform is like wiring your own home: possible, but very few teams should. Most enterprises find that building and maintaining this infrastructure becomes a distraction from their core mission of delivering data products. 

dbt delivers value when supported by clear architecture, testing standards, CI/CD automation, and a platform that enables teams to work safely at scale.

Skip the Infrastructure. Start Delivering.

Datacoves provides managed dbt and Airflow deployed in your private cloud, with pre-built CI/CD, VS Code environments, and best-practice architecture out of the box. Your data never leaves your network. No VPC peering required. 

Learn more about Managed dbt + Airflow  

Decision Checklist for Leaders

A graphic shows decision checklist for leaders

Before adopting or expanding dbt, leaders should ask: 

Is your transformation logic auditable? If business rules live in dashboards, stored procedures, or tribal knowledge, the answer is no. dbt makes every transformation visible, version-controlled, and traceable. 

Do your teams define metrics the same way? If "revenue" or "active user" means different things to different teams, you have metric drift. dbt centralizes definitions in code so everyone works from a single source of truth. 

Where do you find data quality issues? If problems surface in executive dashboards instead of daily data quality check, you lack automated testing. dbt runs tests on every build, catching issues before they reach end users. 

How long does onboarding take? If new analysts spend weeks decoding tribal knowledge, your codebase is not self-documenting. dbt generates documentation and lineage automatically from code. 

Who owns your infrastructure? Decide whether your engineers should be building platforms or building models. Operating dbt at scale requires CI/CD, orchestration, environments, and security. That work must live somewhere. 

Can you trace how a number was calculated? If auditors or regulators ask how a reported figure was derived, you need full lineage from source to dashboard. dbt provides that traceability by design.

The Bottom Line 

dbt has become the standard for enterprise data transformation because it makes business logic visible, testable, and auditable. But the tool alone is not the strategy. Organizations that treat dbt as production infrastructure, with proper orchestration, CI/CD, and governance, unlock its full value. Those who skip the foundation often find themselves rebuilding later.

Ready to skip the infrastructure complexity? See how Datacoves helps enterprises operate dbt at scale

Lightning-Fast Data Stack with dlt
5 mins read

A lean analytics stack built with dlt, DuckDB, DuckLake, and dbt delivers fast insights without the cost or complexity of a traditional cloud data warehouse. For teams prioritizing speed, simplicity, and control, this architecture provides a practical path from raw data to production-ready analytics.

In practice, teams run this stack using Datacoves to standardize environments, manage workflows, and apply production guardrails without adding operational overhead.

A lean analytics stack built with dlt, DuckDB, DuckLake, and dbt delivers fast, production-ready insights without the cost or complexity of a traditional cloud data warehouse.

The Lean Data Stack: Tools and Roles

A lean analytics stack works when each tool has a clear responsibility. In this architecture, ingestion, storage, and transformation are intentionally separated so the system stays fast, simple, and flexible.

  • dlt handles ingestion. It reliably loads raw data from APIs, files, and databases into DuckDB with minimal configuration and strong defaults.
  • DuckDB provides the analytical engine. It is fast, lightweight, and ideal for running analytical queries directly on local or cloud-backed data.
  • DuckLake defines the storage layer. It stores tables as Parquet files with centralized metadata, enabling a true lakehouse pattern without a heavyweight platform.
  • dbt manages transformations. It brings version control, testing, documentation, and repeatable builds to analytics workflows.

Together, these tools form a modern lakehouse-style stack without the operational cost of a traditional cloud data warehouse.

Setting Up DuckDB and DuckLake with MotherDuck

Running DuckDB locally is easy. Running it consistently across machines, environments, and teams is not. This is where MotherDuck matters.

MotherDuck provides a managed control plane for DuckDB and DuckLake, handling authentication, metadata coordination, and cloud-backed storage without changing how DuckDB works. You still query DuckDB. You just stop worrying about where it runs.

To get started:

  1. Create a MotherDuck account.
  2. In Settings → Integrations, generate an API token (MOTHERDUCK_TOKEN).
  3. Configure access to your object storage, such as S3, for DuckLake-managed tables.
  4. Export the token as an environment variable on your local machine.

This single token is used by dlt, DuckDB, and dbt to authenticate securely with MotherDuck. No additional credentials or service accounts are required.

At this point, you have:

  • A DuckDB-compatible analytics engine
  • An open table format via DuckLake
  • Centralized metadata and storage
  • A setup that works the same locally and in production

That consistency is what makes the rest of the stack reliable.

Ingesting Data with dlt into DuckDB

In a lean data stack, ingestion should be reliable, repeatable, and boring. That is exactly what dlt is designed to do.

dlt loads raw data into DuckDB with strong defaults for schema handling, incremental loads, and metadata tracking. It removes the need for custom ingestion frameworks while remaining flexible enough for real-world data sources.

In this example, dlt ingests a CSV file and loads it into a DuckDB database hosted in MotherDuck. The same pattern works for APIs, databases, and file-based sources.

To keep dependencies lightweight and avoid manual environment setup, we use uv to run the ingestion script with inline dependencies.

pip install uv
touch us_populations.py
chmod +x us_populations.py

The script below uses dlt’s MotherDuck destination. Authentication is handled through the MOTHERDUCK_TOKEN environment variable, and data is written to a raw schema in DuckDB.

#!/usr/bin/env -S uv run
# /// script
# dependencies = [
#   "dlt[motherduck]==1.16.0",
#   "psutil",
#   "pandas",
#   "duckdb==1.3.0"
# ]
# ///

"""Loads a CSV file to MotherDuck"""
import dlt
import pandas as pd
from utils.datacoves_utils import pipelines_dir

@dlt.resource(write_disposition="replace")
def us_population():
    url = "https://raw.githubusercontent.com/dataprofessor/dashboard-v3/master/data/us-population-2010-2019.csv"
    df = pd.read_csv(url)
    yield df

@dlt.source
def us_population_source():
    return [us_population()]

if __name__ == "__main__":
    # Configure MotherDuck destination with explicit credentials
    motherduck_destination = dlt.destinations.motherduck(
        destination_name="motherduck",
        credentials={
            "database": "raw",
            "motherduck_token": dlt.secrets.get("MOTHERDUCK_TOKEN")
        }
    )

    pipeline = dlt.pipeline(
        progress = "log",
        pipeline_name = "us_population_data",
        destination = motherduck_destination,
        pipelines_dir = pipelines_dir,

        # dataset_name is the target schema name in the "raw" database
        dataset_name="us_population"
    )

    load_info = pipeline.run([
        us_population_source()
    ])

    print(load_info)

Running the script loads the data into DuckDB:

./us_populations.py

At this point, raw data is available in DuckDB and ready for transformation. Ingestion is fully automated, reproducible, and versionable, without introducing a separate ingestion platform.

Transforming Data with dbt and DuckLake

Once raw data is loaded into DuckDB, transformations should follow the same disciplined workflow teams already use elsewhere. This is where dbt fits naturally.

dbt provides version-controlled models, testing, documentation, and repeatable builds. The difference in this stack is not how dbt works, but where tables are materialized.

By enabling DuckLake, dbt materializes tables as Parquet files with centralized metadata instead of opaque DuckDB-only files. This turns DuckDB into a true lakehouse engine while keeping the developer experience unchanged.

To get started, install dbt and the DuckDB adapter:

pip install dbt-core==1.10.17
pip install dbt-duckdb==1.10.0
dbt init

Next, configure your dbt profile to target DuckLake through MotherDuck:

default:
  outputs:
    dev:
      type: duckdb
      # This requires the environment var MOTHERDUCK_TOKEN to be set
      path: 'md:datacoves_ducklake'
      threads: 4
      schema: dev  # this will be the prefix used in the duckdb schema
      is_ducklake: true

  target: dev

This configuration does a few important things:

  • Authenticates using the MOTHERDUCK_TOKEN environment variable
  • Writes tables using DuckLake’s open format
  • Separates transformed data from raw ingestion
  • Keeps development and production workflows consistent

With this in place, dbt models behave exactly as expected. Models materialized as tables are stored in DuckLake, while views and ephemeral models remain lightweight and fast.

From here, teams can:

  • Add dbt tests for data quality
  • Generate documentation and lineage
  • Run transformations locally or in shared environments
  • Promote models to production without changing tooling

This is the key advantage of the stack: modern analytics engineering practices, without the overhead of a traditional warehouse.

When This Stack Makes Sense

This lean stack is not trying to replace every enterprise data warehouse. It is designed for teams that value speed, simplicity, and cost control over heavyweight infrastructure.

This approach works especially well when:

  • You want fast analytics without committing to a full cloud warehouse.
  • Your team prefers open, file-based storage over proprietary formats.
  • You are building prototypes, internal analytics, or domain-specific data products.
  • Cost predictability matters more than elastic, multi-tenant scale.
  • You want modern analytics engineering practices without platform sprawl.

The trade-offs are real and intentional. DuckDB and DuckLake excel at analytical workloads and developer productivity, but they are not designed for high-concurrency BI at massive scale. Teams with hundreds of dashboards and thousands of daily users may still need a traditional warehouse.

Where this stack shines is time to value. You can move from raw data to trusted analytics quickly, with minimal infrastructure, and without locking yourself into a platform that is expensive to unwind later.

In practice, many teams use this architecture as:

  • A lightweight production analytics stack
  • A proving ground before scaling to a larger warehouse
  • A cost-efficient alternative for departmental or embedded analytics

When paired with Datacoves, teams get the operational guardrails this stack needs to run reliably. Datacoves standardizes environments, integrates orchestration and CI/CD, and applies best practices so the simplicity of the stack does not turn into fragility over time.

Teams often run this stack with Datacoves to standardize environments, apply production guardrails, and avoid the operational drag of DIY platform management.

See it in action

If you want to see this stack running end to end, watch the Datacoves + MotherDuck webinar. It walks through ingestion with dlt, transformations with dbt and DuckLake, and how teams operationalize the workflow with orchestration and governance.

The session also covers:

  • When DuckDB and DuckLake work well in production
  • How to add orchestration with Airflow
  • How teams visualize results with Streamlit

Watch the full session here

Balancing innovation and risk in the dbt fivetran era
5 mins read

The merger of dbt Labs and Fivetran (which we refer to as dbt Fivetran for simplicity) represents a new era in enterprise analytics. The combined company is expected to create a streamlined, end-to-end data workflow consolidating data ingestion, transformation, and activation with the stated goal of reducing operational overhead and accelerating delivery. Yet, at the dbt Coalesce conference in October 2025 and in ongoing conversations with data leaders, many are voicing concerns about price uncertainty, reduced flexibility, and the long-term future of dbt Core.

As enterprises evaluate the implications of this merger, understanding both the opportunities and risks is critical for making informed decisions about their organization's long-term analytics strategy.

In this article, you’ll learn: 

1. What benefits could the dbt Fivetran merger offer enterprise data teams

2. Key risks and lessons from past open-source acquisitions

3. How enterprises can manage risks and challenges 

4. Practical steps dbt Fivetran can take to address community anxiety

dbt Labs and Fivetran

Streamlined Data Stack: The Promised Benefits of the dbt Fivetran Merger 

For enterprise data teams, the dbt Fivetran merger may bring compelling opportunities: 

1. Integrated Analytics Stack:

The combination of ingestion, transformation, and activation (reverse ETL) processes may enhance onboarding by streamlining contract management, security evaluations, and user training. 

2. Resource Investment:

The merged company has the potential to speed up feature development across the data landscape. Open data standards like Iceberg could see increased adoption, fostering interoperability between platforms such as Snowflake and Databricks.

While these prospects are enticing, they are not guaranteed. The newly formed organization now faces the non-trivial task of merging various teams, including Fivetran, HVR (Oct 2021), Census (May 2025), SQLMesh/Tobiko (Sept 2025), and dbt Labs (Oct 2025). Successfully integrating their tools, development practices, and support functions will be crucial. To create a truly seamless, end-to-end platform, alignment of product roadmaps, engineering standards, and operational processes will be necessary. Enterprises should carefully assess the execution risks when considering the promised benefits of this merger, as these advantages hinge on Fivetran's ability to effectively integrate these technologies and teams.

Project using fusion
Image Credit - dbtlabs

The Future of dbt Core: Examining License Risk and the Rise of dbt Fusion 

The future openness and flexibility of dbt Core is being questioned, with significant consequences for enterprise data teams that rely on open-source tooling for agility, security, and control.

dbt’s rapid adoption, now exceeding 80,000 projects, was fueled by its permissive Apache License and a vibrant, collaborative community. This openness allowed organizations to deploy, customize, and extend dbt to fit their needs, and enabled companies like Datacoves to build complementary tools, sponsor open-source projects, and simplify enterprise data workflows. 

However, recent moves by dbt Labs, accelerated by the Fivetran merger, signal a natural evolution toward monetization and enterprise alignment:

1. Licensing agreement with Snowflake 

2. Rewriting dbt Core as dbt Fusion under a more restrictive ELv2 license 

3. Introducing a “freemium” model for the dbt VS Code Extension, limiting free use to 15 registered users per organization

Projects using Core
Image Credit - dbtlabs

While these steps are understandable from a business perspective, they introduce uncertainty and anxiety within the data community. The risk is that the balance between open innovation and commercial control could tip, raising understandable questions about long-term flexibility that enterprises have come to expect from dbt Core. 

dbt Labs and Fivetran have both stated that dbt Core's license would not change, and I believe them. The vast majority of dbt users are using dbt Core and changing the licenses risks fragmentation and loss of goodwill in the community. The future vision for dbt is not dbt Core, but instead dbt Fusion. 

While I see a future for dbt Core, I don't feel the same about SQLMesh. There is little chance that the dbt Fivetran organization would continue to invest in two open-source projects. It is also unlikely that SQLMesh innovations would make their way into dbt Core, as that would directly compete with dbt Fusion.

Vendor Lock-in Lessons: What History Teaches About Open-Source License Changes (Terraform, ElasticSearch) 

Recent history offers important cautionary tales for enterprises. While not a direct parallel, it’s worth learning from: 

1. Terraform: A license change led to fragmentation and the creation of OpenTofu, eroding trust in the original steward. 

2. ElasticSearch: License restrictions resulted in the OpenSearch fork, dividing the community and increasing support risks. 

3. Redis and MongoDB: Similar license shifts caused forks or migrations to alternative solutions, increasing risk and migration costs.

For enterprise data leaders, these precedents highlight the dangers of vendor fragmentation, increased migration costs, and uncertainty around long-term support. When foundational tools become less open, organizations may face difficult decisions about adapting, migrating, or seeking alternatives. If you're considering your options, check out our Platform Evaluation Worksheet.

On the other hand, there are successful models where open-source projects and commercial offerings coexist and thrive: 

1. Airflow: Maintains a permissive license, with commercial providers offering managed services and enterprise features. 

2. GitLab, Spark, and Kafka: Each has built a sustainable business around a robust open-source core, monetizing through value-added services and features. 

These examples show that a healthy open-source core, supported by managed services and enterprise features, can benefit all stakeholders, provided the commitment to openness remains.

Enterprise Action Plan: 4 Strategies to Mitigate Consolidation Risks and Maintain Flexibility 

To navigate the evolving landscape, enterprises should: 

1. Monitor licensing and governance changes closely. 

2. Engage in community and governance discussions to advocate for transparency. 

3. Plan for contingencies, including potential migration or multi-vendor strategies. 

4. Diversify by avoiding over-reliance on a single vendor or platform.

Governance & Vendor Strategy 

Avoid Vendor Lock-In: 

1. Continue to leverage multiple tools for data ingestion and orchestration (e.g., Airflow) instead of relying solely on a single vendor’s stack. 

2. Why? This preserves your ability to adapt as technology and vendor priorities evolve. While tighter tool integration is a potential promise of consolidation, options exist to reduce the burden of a multi-tool architecture.

For instance, Datacoves is built to help enterprises maintain governance, reliability, and freedom of choice to deploy securely in their own network, specifically supporting multi-tool architectures and open standards to minimize vendor lock-in risk. 

Demand Roadmap Transparency: 

1. Engage with your vendors about their product direction and advocate for community-driven development. 

2. Why? Transparency helps align vendor decisions with your business needs and reduces the risk of disruptive surprises. 

Community Engagement 

Participate in Open-Source Communities: 

1. Contribute to and help maintain the open-source projects that underpin your data platform. 

2. Why? Active participation ensures your requirements are heard and helps sustain the projects you depend on. 

Attend and Sponsor Diverse Conferences: 

1. Support and participate in community-driven events (such as Airflow Summit) to foster innovation and avoid concentration of influence. 

2. Why? Exposure to a variety of perspectives leads to stronger solutions and a healthier ecosystem. 

Supporting Open Source 

Support OSS Creators Financially and Through Advocacy: 

1. Sponsor projects or directly support maintainers of critical open-source tools. 

2. Why? Sustainable funding and engagement are vital for the health and reliability of the open-source ecosystem. 

Encourage Openness and Diversity 

1. Champion Diversity in OSS Governance: Advocate for broad, meritocratic project leadership and a diverse contributor base. 

2. Why? Diverse stewardship drives innovation, resilience, and reduces the risk of any one entity dominating the project’s direction.

Long-term analytics success isn’t just about technology selection. It’s about actively shaping the ecosystem through strategic diversification, transparent vendor engagement, and meaningful support of open standards and communities. Enterprises that invest in these areas will be best equipped to thrive, no matter how the vendor landscape evolves.

Preserving Trust: How dbt Fivetran Can Maintain Community Confidence and Avoid Fragmentation

While both dbt Labs and Fivetran have stated that the dbt Core license would remain permissive, to preserve trust and innovation in the data community, dbt Fivetran should commit to neutral governance and open standards for dbt Core, ensuring it remains a true foundation for collaboration, not fragmentation. 

It is common knowledge that the dbt community has powered a remarkable flywheel of innovation, career growth, and ecosystem expansion. Disrupting this momentum risks technical fragmentation and loss of goodwill, outcomes that benefit no one in the analytics landscape. 

To maintain community trust and momentum, dbt Fivetran should:

1. Establish Neutral Governance:

Place dbt Core under independent oversight, where its roadmap is shaped by a diverse set of contributors, not just a single commercial entity. Projects like Iceberg have shown that broad-based governance sustains engagement and innovation, compared to more vendor-driven models like Delta Lake. 

2. Consider Neutral Stewardship Models:

One possible long-term approach that has been seen in projects like Iceberg and OpenTelemetry is to place an open-source core under neutral foundation governance (for example, the Linux Foundation or Apache Software Foundation).

While dbt Labs and Fivetran have both reaffirmed their commitment to keeping dbt Core open, exploring such models in the future could further strengthen community trust and ensure continued neutrality as the platform evolves.

3. Encourage Meritocratic Development: Empower a core team representing the broader community to guide dbt Core’s future. This approach minimizes the risk of forks and fragmentation and ensures that innovation is driven by real-world needs. 

4. Apply Lessons from MetricFlow: When dbt Labs acquired MetricFlow and changed its license to BSL, it led to further fragmentation in the semantic layer space. Now, with MetricFlow relicensed as Apache and governed by the Open Semantic Interchange (OSI) initiative (including dbt Labs, Snowflake, and Tableau), the project is positioned as a vendor-neutral standard. This kind of model should be considered for dbt Core as well.

Making these changes will have a direct impact on:

1. Technical teams: By ensuring continued access to an open, extensible framework, and reducing the risk of disruptive migration. 

2. Business leaders: By protecting investments in analytics workflows and minimizing vendor lock-in or unexpected costs. 

Solidifying dbt Core as a true open standard benefits the entire ecosystem, including dbt Fivetran, which is building its future, dbt Fusion, on this foundation. Taking these steps would not only calm community anxiety but also position dbt Fivetran as a trusted leader for the next era of enterprise analytics.

Conclusion: The Road Ahead for Enterprise Analytics 

The dbt Fivetran merger represents a defining moment for the modern data stack, promising streamlined workflows while simultaneously raising critical questions about vendor lock-in, open-source governance, and long-term flexibility. Successfully navigating this shift requires a proactive, diversified strategy, one that champions open standards and avoids over-reliance on any single vendor. Enterprises that invest in active community engagement and robust contingency planning will be best equipped to maintain control and unlock maximum value from their analytics platforms.

Maintain Flexibility with a Managed Platform 

If your organization is looking for a way to mitigate these risks and secure your workflows with enterprise-grade governance and multi-tool architecture, Datacoves offers a managed platform designed for maximum flexibility and control. For a deeper look, find out what Datacoves has to offer

Ready to take control of your data future? Contact us today to explore how Datacoves allows organizations to take control while still simplifying platform management and tool integration.

Why executives can’t ignore data orchestration
5 mins read

Data orchestration is the foundation that ensures every step in your data value chain runs in the correct order, with the right dependencies, and with full visibility. Without it, even the best tools such as dbt, Airflow, Snowflake, or your BI platform operate in silos. This disconnect creates delays, data fires, and unreliable insights.

For executives, data orchestration is not optional. It prevents fragmented workflows, reduces operational risk, and helps teams deliver trusted insights quickly and consistently. When orchestration is built into the data platform from the start, organizations eliminate hidden technical debt, scale more confidently, and avoid the costly rework that slows innovation.

In short, data orchestration is how modern data teams deliver reliable, end-to-end value without surprises.

In today’s fast-paced business environment, executives are under increased pressure to deliver quick wins and measurable results. However, one capability that is often overlooked is data orchestration.

This oversight can sabotage progress as the promise of data modernization efforts fails to deliver expected outcomes in terms of ROI and improved efficiencies.

In this article, we will explain what data orchestration is, the risks of not implementing proper data orchestration, and how executives benefit from end-to-end data orchestration.

What Is Data Orchestration? (Simple Definition for Executives)

Data orchestration ensures every step in your data value chain runs in the right order, with the right dependencies, and with full visibility.
An infographic of data orchestration practice

Data orchestration is the practice of coordinating all the steps in your organization’s data processes so they run smoothly, in the right order, and without surprises. Think of it as the conductor ensuring each instrument plays at the right time to create beautiful music.

Generating insights is a multi-tool process. What’s the problem with this setup? Each of these tools may include its own scheduler, and they will each run in a silo. Even if an upstream step fails or is delayed, the subsequent steps will run. This disconnect leads to surprises for executives expecting trusted insights. This in turn, leads to delays and data fires, which are disruptive and inefficient for the organization. 

Imagine you are baking a chocolate cake. You would need a recipe, all the ingredients, and a functioning oven. However, you wouldn’t turn on the oven before buying the ingredients and mixing the batter if your milk had spoiled. Not having someone orchestrating all the steps in the right sequence would lead to a disorganized process that is inefficient and wasteful. You also know not to continue if there is a critical issue, such as spoiled milk. 

Data orchestration solves the problem of having siloed tools by connecting all the steps in the data value chain. This way, if one step is delayed or fails, subsequent steps do not run. With a data orchestration tool, we can also notify someone to resolve the issue so they can act quickly, reducing fires and providing visibility to the entire process.

Key Components of Modern Data Orchestration

  • Coordinated Workflows: Makes sure all tools and teams work together without unnecessary manual steps.
  • Problem Detection: Identifies issues early so they don’t disrupt reporting or decision-making.
  • Clear Oversight: Gives executives and teams visibility into the data value chain across the organization.
  • Scalable Processes: Ensures your data operations can grow with your business without causing chaos.

Data Orchestration vs. ETL (Clear Distinction)

ETL (Extract, Transform, and Load) focuses on moving and transforming data, but data orchestration is about making sure everything happens in the right sequence across all tools and systems. It’s the difference between just having the pieces of a puzzle and putting them together into a clear picture.

Why Orchestration Matters: The Business Case

Faster, More Reliable Analytics Delivery

Without data orchestration, even the best tools operate in silos, creating delays, data fires, and unreliable insights.

Executives make many decisions but rarely have the time to dive into technical details. They delegate research and expect quick wins, which often leads to mixed messaging. Leaders want resilient, scalable, future-proof solutions, yet they also pressure teams to deliver “something now.” Vendors exploit this tension. They sell tools that solve one slice of the data value chain but rarely explain that their product won't fix the underlying fragmentation. Quick wins may ship, but the systemic problems remain.

Data orchestration removes this friction. When workflows are unified, adding steps to the data flow is straightforward, pipelines are predictable, and teams deliver high-quality data products faster and with far fewer surprises.

Reduced Firefighting and Operational Risk

A major Datacoves customer summarized the difference clearly:
“Before, we had many data fires disrupting the organization. Now issues still occur, but we catch them immediately and prevent bad data from reaching stakeholders.”

Without orchestration, each new tool adds another blind spot. Teams don’t see failures until they hit downstream systems or show up in dashboards. This reactive posture creates endless rework, late-night outages, and a reputation problem with stakeholders.

With orchestration, failures surface early. Dependencies, quality checks, and execution paths are clear. Teams prevent incidents instead of reacting to them.

Governance and Visibility Across the Data Lifecycle

Data orchestration isn’t just about automation; it’s about governance.
It ensures:

  • Clear ownership
  • Predictable workflows
  • Consistent development processes
  • End-to-end visibility across ingestion, transformation, analytics, and activation.

This visibility dramatically improves trust. Stakeholders no longer get “chocolate cake” made with spoiled milk. A new tool may bake faster, but if upstream data is broken, the final product is still compromised.

Orchestration ensures the entire value chain is healthy, not just one ingredient.

Supports Ingestion and dbt Scheduling with Airflow

Modern data teams rely heavily on tools like dbt and Airflow, but these tools do not magically align themselves. Without orchestration:

  • dbt jobs run inconsistently
  • Airflow DAGs are difficult to track across environments
  • Quality checks run out of order, or not at all

With orchestration in place, ingestion, dbt scheduling, and activation become reliable, governed, and transparent, ensuring every step runs at the right time, in the right order, with the right dependencies. Learn more in our guide on the difference between dbt Cloud vs dbt Core.

For more details on how dbt schedules and runs models, see the official dbt documentation.

To learn how Airflow manages task dependencies and scheduling, visit the official Apache Airflow documentation.

Data orchestration:

  • Bridges gaps across systems
  • Reduces hidden costs and technical debt
  • Provides end-to-end visibility
  • Prevents costly rework as you scale

The Cost of NOT Having Orchestration

It is tempting to postpone data orchestration until the weight of data problems makes it unavoidable. Even the best tools and talented teams can struggle without a clear orchestration strategy. When data processes aren’t coordinated, organizations face inefficiencies, errors, and lost opportunities.

Implementing data orchestration early reduces hidden technical debt, prevents rework, and helps teams deliver trusted insights faster.

Fragmented Tools Create Operational Inefficiencies

When data pipelines rely on multiple systems that don’t communicate well, teams spend extra time manually moving data, reconciling errors, and firefighting issues. This slows decision-making and increases operational costs.

Common symptoms of fragmented tools include:

  • Manual data movement
  • Frequent reconciliations
  • Increased firefighting
  • Slower decision-making

Quick Wins Without Orchestration Create Long-Term Pain

Many organizations focus on a “quick wins” approach only to discover that the cost of moving fast was long-term lack of agility and technical debt. This approach may deliver immediate results but leads to technical debt, wasted spend, and fragile data processes that are hard to scale. A great example is Data Drive’s journey, before adding data orchestration, when issues occurred, they had to spend time debugging each step of their disconnected process. Now it is clear where an issue has occurred, enabling them to resolve issues faster for their stakeholders. 

Costly Rework and Compounding Technical Debt

As organizations grow, the absence of orchestration forces teams to revisit and fix processes repeatedly. Embedding orchestration from the start avoids repeated firefighting, accelerates innovation, and makes scaling smoother. Improving one step alone cannot deliver the desired outcome, just like a single egg cannot make a cake. 

Limited Visibility Erodes Trust in Insights

Organizations without data orchestration are effectively flying blind. Disconnected processes run out of order and issues are discovered by frustrated stakeholders. Resource-constrained data teams spend their time firefighting instead of delivering new insights. The result is delays in decision-making, higher operating costs, and an erosion of trust in data. Embedding orchestration from the start avoids repeated firefighting, accelerates innovation, and makes scaling smoother.

Data orchestration diagram showing integrated workflows across tools

Common Roadblocks and How to Avoid Them

If data orchestration is so important, why do organizations go without it? We often hear some common objections: 

Roadblock: Lack of Awareness

Many organizations have not heard of data orchestration and tool vendors rarely highlight this need. It’s only after a painful experience that they realize this essential need. 

Roadblock: It Will Add Complexity

It’s true that data orchestration adds another layer, but without it, you have disconnected, siloed processes. The real cost comes from chaos, not from coordination. 

Roadblock: Another Tool Will Make Things Harder

Vendor sprawl can indeed introduce additional risks, that’s why all-in-one platforms like Datacoves reduce integration overhead by bundling enterprise-grade orchestration, like Airflow, without increasing vendor lock-in. Explore Datacoves’ Integrated Orchestration Platform

Roadblock: Orchestration Makes Processes More Complex

Data value chains are inherently complex, with multiple data sources, ingestion processes, transformations, and data consumers. Data orchestration does not introduce complexity; it provides visibility and control over this complexity. 

It may seem reasonable to postpone data orchestration in the short term. But every mature data organization, both large and small, eventually needs to scale. By building-in data orchestration into the data platform from the start, you set up your teams for success, reduce firefighting, and avoid costly and time-consuming rework. Most importantly, the business receives trustworthy insights faster. 

How to Implement Data Orchestration Successfully

Implementing data orchestration doesn’t have to be complicated. The key is to approach it strategically, ensuring that every process is aligned, visible, and scalable.

Step 1: Start with a Clear, Business-Aligned Plan

Begin by mapping your existing data processes and identifying where inefficiencies or risks exist. Knowing exactly how data flows across teams and tools allows you to prioritize the areas that will benefit most from orchestration.

Key outcomes:

  • Clear understanding of current workflows
  • Prioritized areas for improvement
  • Better alignment across teams

Step 2: Automate High-Value Workflows First

Focus first on automating repetitive and error-prone steps such as data collection, cleaning, and routing. Automation reduces manual effort, frees up your team for higher-value work, and ensures processes run consistently.

Key outcomes:

  • Reduced manual effort
  • More consistent execution
  • Teams freed for strategic work

Step 3: Build Cross-Pipeline Visibility and Monitoring

Implement dashboards or monitoring tools that provide executives and teams with real-time visibility into data flows. Early detection of errors prevents costly mistakes and increases confidence in the insights being delivered.

Key outcomes:

  • Faster error detection
  • Increased trust in insights
  • Smoother incident response

Step 4: Scale Gradually with Dependencies and Governance

Start small with high-impact processes and expand orchestration across more workflows over time. Scaling gradually ensures that teams adopt the changes effectively and that processes remain manageable as data volume grows.

Key outcomes:

  • More predictable scaling
  • Stable workflows as volume grows
  • Stronger process governance

Step 5: Choose Tools that Align with Ingestion & dbt Scheduling

Select tools that integrate well with your existing systems, and provide flexibility for future growth. Popular orchestration tools include dbt and Airflow, but the best choice depends on your organization’s specific workflows and needs. Explore how these capabilities come packaged in the Datacoves Platform Features overview.

Key outcomes:

  • Better tool compatibility
  • Lower integration overhead
  • Simpler long-term evolution of your stack

Top Benefits and ROI of Data Orchestration

Investing in data orchestration delivers tangible business value. Organizations that implement orchestration gain efficiency, reliability, and confidence in their decision-making.

Improved Efficiency Across Data & Analytics Teams

Data orchestration reduces manual work, prevents duplicated efforts, and streamlines processes. Teams can focus on higher-value initiatives instead of firefighting data issues.

More Reliable, Trustworthy Insights

With coordinated workflows and monitoring, executives and stakeholders can trust the data they rely on. Decisions are backed by accurate, timely, and actionable insights.

Reduced Operational Costs and Technical Debt

By embedding data orchestration early, organizations avoid expensive rework, reduce errors, and prevent the accumulation of technical debt from ad hoc solutions.

Faster Innovation and Scalable Growth

Data orchestration ensures that data pipelines scale smoothly as the organization grows. Teams can launch new analytics initiatives faster, confident that their underlying processes are robust and repeatable.

Enhanced Visibility Across the Data Lifecycle

Executives gain a clear view of the entire data lifecycle, enabling better oversight, risk management, and strategic planning.

Final Recommendation: Orchestration Is the Foundation, Not the Finish Line

Data orchestration should not be seen as a “nice to have” feature that can be postponed. Mature organizations understand that it is the foundation needed to deliver trusted insights faster. Without it, companies risk setting up siloed tools, increased data firefighting, and eroding trust in both the data and the data team. With it, organizations gain visibility, agility, and the confidence that insights fueling decisions are accurate. 

The real question for strategic leaders is whether to try to piece together disconnected solutions, focusing only on short-term wins, or invest in data orchestration early and unlock the full potential of a connected ecosystem.

For executives, prioritizing data orchestration will mean fewer data fires, accelerated innovation, and an environment where trusted insights flow as reliably as the business demands. 

To see how orchestration is built into the Datacoves platform, visit our Integrated Orchestration page.

Don’t wait until complexity forces your hand. Your team deserves to move faster and fight fewer fires.

Book a personalized demo to see how data orchestration with Datacoves helps leaders unlock value from day one.

New Features from the Databricks AI Summit 2025
5 mins read

The Databricks AI Summit 2025 revealed a major shift toward simpler, AI-ready, and governed data platforms. From no-code analytics to serverless OLTP and agentic workflows, the announcements show Databricks is building for a unified future.

In this post, we break down the six most impactful features announced at the summit and what they mean for the future of data teams.

1. Databricks One and Genie: Making Analytics Truly Accessible

Databricks One (currently in private preview) introduces a no-code analytics platform aimed at democratizing access to insights across the organization. Powered by Genie, users can now interact with business data through natural language Q&A, no SQL or dashboards required. By lowering the barrier to entry, tools like Genie can drive better, faster decision-making across all functions.

Datacoves Take: As with any AI we have used to date, having a solid foundation is key. AI can not solve ambiguous metrics and a lack of knowledge. As we have mentioned, there are some dangers in trusting AI, and these caveats still exist.

Making Analytics Truly Accessible
Image credit

2. Lakebase: A Serverless Postgres for the Lakehouse

In a bold move, Databricks launched Lakebase, a Postgres-compatible, serverless OLTP database natively integrated into the lakehouse. Built atop the foundations laid by the NeonDB acquisition, Lakebase reimagines transactional workloads within the unified lakehouse architecture. This is more than just a database release; it’s a structural shift that brings transactional (OLTP) and analytical (OLAP) workloads together, unlocking powerful agentic and AI use cases without architectural sprawl. 

Datacoves Take: We see both Databricks and Snowflake integrating Postgres into their offering. Ducklake is also demonstrating a simpler future for Iceberg catalogs. Postgres has a strong future ahead, and the unification of OLAP and OLTP seems certain.

A Serverless Postgres for the Lakehouse
Image credit

3. Agent Bricks: From Prototype to Enterprise-Ready AI Agents

With the introduction of Agent Bricks, Databricks is making it easier to build, evaluate, and operationalize agents for AI-driven workflows. What sets this apart is the use of built-in “judges” - LLMs that automatically assess agent quality and performance. This moves agents from hackathon demos into the enterprise spotlight, giving teams a foundation to develop production-grade AI assistants grounded in company data and governance frameworks.

Datacoves Take: This looks interesting, and the key here still lies in having a strong data foundation with good processes. Reproducibility is also key. Testing and proving that the right actions are performed will be important for any organization implementing this feature.

From Prototype to Enterprise-Ready AI Agents
Image credit

4. Databricks Apps: Interfaces That Inherit Governance by Design

Databricks introduced Databricks Apps, allowing developers to build custom user interfaces that automatically respect Unity Catalog permissions and metadata. A standout demo showed glossary terms appearing inline inside Chrome, giving business users governed definitions directly in the tools they use every day. This bridges the gap between data consumers and governed metadata, making governance feel less like overhead and more like embedded intelligence.

Datacoves Take: Metadata and catalogs are important for AI, so we see both Databricks and Snowflake investing in this area. As with any of these changes, technology is not the only change needed in the organization. Change management is also important. Without proper stewardship, ownership, and review processes, apps can’t provide the experience promised.

Interfaces That Inherit Governance by Design
Image credit

5. Unity Catalog Enhancements: Open Governance at Scale

Unity Catalog took a major step forward at the Databricks AI Summit 2025, now supporting managed Apache Iceberg tables, cross-engine interoperability, and introducing Unity Catalog Metrics to define and track business logic across the organization.

This kind of standardization is critical for teams navigating increasingly complex data landscapes. By supporting both Iceberg and Delta formats, enabling two-way sync, and contributing to the open-source ecosystem, Unity Catalog is positioning itself as the true backbone for open, interoperable governance.

Datacoves Take: The Iceberg data format has the momentum behind it; now it is up to the platforms to enable true interoperability. Organizations are expecting a future where a table can be written and read from any platform. DuckLake is also getting in the game, simplifying how metadata is managed, and multi-table transactions are enabled. It will be interesting to see if Unity and Polaris take some of the DuckLake learnings and integrate them in the next few years.

Open Governance at Scale
Image credit

6. Forever-Free Tier and $100M AI Training Fund

In a community-building move, Databricks introduced a forever-free edition of the platform and committed $100 million toward AI and data training. This massive investment creates a pipeline of talent ready to use and govern AI responsibly. For organizations thinking long-term, this is a wake-up call: governance, security, and education need to scale with AI adoption, not follow behind.

Datacoves Take: This feels like a good way to get more people to try Databricks without a big commitment. Hopefully, competitors take note and do the same. This will benefit the entire data community.

Read the full post from Databricks here:
https://www.databricks.com/blog/summary-dais-2025-announcements-through-lens-games

What Data Leaders Must Do Next After Databricks AI Summit 2025

Democratizing Data Access Is Critical

With tools like Databricks One and Genie enabling no-code, natural language analytics, data leaders must prioritize making insights accessible beyond technical teams to drive faster, data-informed decisions at every level.

Simplify and Unify Data Architecture

Lakebase’s integration of transactional and analytical workloads signals a move toward simpler, more efficient data stacks. Leaders should rethink their architectures to reduce complexity and support real-time, AI-driven applications.

Operationalize AI Agents for Business Impact

Agent Bricks and built-in AI judges highlight the shift from experimental AI agents to production-ready, measurable workflows. Data leaders need to invest in frameworks and governance to safely scale AI agents across use cases.

Governance Must Span Formats and Engines

Unity Catalog’s expanded support for Iceberg, Delta, and cross-engine interoperability emphasizes the need for unified governance frameworks that handle diverse data formats while maintaining business logic and compliance.

Invest in Talent and Training to Keep Pace

The launch of a free tier and $100M training fund underscores the growing demand for skilled data and AI practitioners. Data leaders should plan for talent development and operational readiness to fully leverage evolving platforms.

The Road Ahead: Operationalizing AI the Datacoves Way

The Databricks AI Summit 2025 signals a fundamental shift: from scattered tools and isolated workflows to unified, governed, and AI-native platforms. It’s not just about building smarter systems; it’s about making those systems accessible, efficient, and scalable for the entire organization.

While these innovations are promising, putting them into practice takes more than vision; it requires infrastructure that balances speed, control, and usability.

That’s where Datacoves comes in.

Our platform accelerates the adoption of modern tools like dbt, Airflow, and emerging AI workflows, without the overhead of managing complex environments. We help teams operationalize best practices from day one, reducing total cost of ownership while enabling faster delivery, tighter governance, and AI readiness at scale. Datacoves supports Databricks, Snowflake, BigQuery, and any data platform with a dbt adapter. We believe in an open and interoperable feature where tools are integrated without increasing vendor lock-in. Talk to us to find out more.

Want to learn more? Book a demo with Datacoves.

Hidden dangers of AI
5 mins read

Introduction

Large Language Models (LLMs) like ChatGPT and Claude are becoming common in modern data workflows. From writing SQL queries to summarizing dashboards, they offer speed and support across both technical and non-technical teams. But as organizations begin to rely more heavily on these tools, the risks start to surface.  

The dangers of AI are not in what it cannot do, but in what it does too confidently. LLMs are built to sound convincing, even when the information they generate is inaccurate or incomplete. In the context of data analytics, this can lead to hallucinated metrics, missed context, and decisions based on misleading outputs. Without human oversight, these issues can erode trust, waste time, and create costly setbacks. 

This article explores where LLMs tend to go wrong in analytics workflows and why human involvement remains essential. Drawing from current industry perspectives and real-world examples, we will look at how to use LLMs effectively without sacrificing accuracy, accountability, or completeness.

Introduction to LLM

LLMs Are Impressive, But Not Always Right 

LLMs like ChatGPT and Claude are excellent at sounding smart. That’s the problem. 

They’re built to generate natural-sounding language, not truth. So, while they may give you a SQL query or dashboard summary in seconds, that doesn’t mean it’s accurate. In fact, many LLMs can hallucinate metrics, invent dimensions, or produce outputs that seem plausible but are completely wrong.  

You wouldn’t hand over your revenue targets to someone who “sounds confident,” right? So why are so many teams doing that with AI? 

Here's the challenge. Even experienced data professionals can be fooled. The more fluent the output, the easier it is to miss the flaws underneath. 

This isn’t just theory, Castor warns that “most AI-generated queries will work, but they won’t always be right.” That tiny gap between function and accuracy is where risk lives.

The Real Dangers of AI in Analytics 

If you’re leading a data-driven team or making decisions based on LLM-generated outputs, these are the real risks you need to watch out for.

1. Hallucinated Insights

LLMs can fabricate filters, columns, and logic that don’t exist. In the moment, you might not notice but if those false insights inform a slide for board meeting or product decision, the damage is done.

Hallucinated insights

Here is an example where an image was provided to ChatGPT and asked to point out the three differences between the two pictures as illustrated below:

Before comparison

Here is the output from ChatGPT:  

Here are the 3 differences between the two images: 

  • Eyebrow: The character on the left has an eyebrow, while the one on the right does not. 
  • Shirt Stripe: The red shirt under the jacket has two white stripes on the left image, but only one stripe on the right image. 
  • Shoe Sole: The left image shows a visible gray sole on the right shoe, which is missing in the right image. 

Let me know if you want these highlighted visually!

When asked to highlight the differences, ChatGPT produced the following image:

After ChatGPT comparison

As you can see, ChatGPT skewed the information and exaggerated the differences in the image. Only one (the shirt stripe) out of the three was correct, while it missed the differences with the sock, the hair, and it even changed the author in the copyright!

2. No Understanding of Business Context

AI doesn’t know your KPIs, fiscal calendars, or market pressures. Business context still requires a human lens to interpret properly. Without that, you risk misreading what the data is trying to say.

It’s Hard to Verify

AI doesn’t give sources or confidence scores. You often can’t tell where a number came from. Secoda notes that teams still need to double-check model outputs before trusting them in critical workflows. 

Non-Technical Teams May Misuse It

One of the great things about LLMs is how accessible they are. But that also means anyone can generate analytics even if they don’t understand the data underneath. This creates a gap between surface-level insight and actual understanding. 

Too Much Automation Slows You Down

Over-relying on LLMs can create more cleanup work than if a skilled analyst had just done it from the start. Over-relying on automation often creates cleanup work that slows teams down. As we noted in Modern Data Stack Acceleration, true speed comes from workflows designed with governance and best practices.

Why Human Oversight Still Matters 

If you’ve ever skimmed an LLM-generated responce and thought, “That’s not quite right,” you already know the value of human oversight. 

AI is fast, but it doesn’t understand what matters to your stakeholders. It can’t distinguish between a seasonal dip and a business-critical loss. And it won’t ask follow-up questions when something looks off. 

Think of LLMs like smart interns. They can help you move faster, but they still need supervision. Your team’s expertise, your mental model of how data maps to outcomes, is irreplaceable. Tools like Datacoves embed governance throughout the data journey to ensure humans stay in the loop. Or as Scott Schlesinger says, AI is the accelerator, not the driver. It’s a reminder that human hands still need to stay on the wheel to ensure we’re heading in the right direction.

Use cases that add immediate value without introducing much risk 

Datacoves helps teams enforce best practices around documentation, version control, and human-in-the-loop development. By giving analysts structure and control, Datacoves makes it easier to integrate AI tools without losing trust or accountability. Here are some examples where Datacoves’ integrated GenAI can boost productivity while keeping you in control. 

  • Debugging errors: GenAI helps humans pinpoint code errors and suggest fixes and human expertise ensures changes are correct and safe. 
  • Model testing: GenAI along with MCP servers can be used to profile data and recommend tests to an experienced analytics engineer and assure tests make sense for the business. 
  • Automating documentation: GenAI can be used to add and populate missing dbt yml files by adding column details and documentation. This automation saves time, but human validation remains critical to ensure accuracy and context. 
  • Onboarding docs: AI can draft detailed onboarding guides with project structure, tools, and best practices, saving time for team leads, while allowing them to validate output for completeness and accuracy. 
  • Other Documentation: GenAI can accelerate generation of documentation for webinars or repos, but manual review is key to catch inaccuracies.

How to Use LLMs Without Losing Control 

Want to keep the speed of LLMs and the trust of your team? Here’s how to make that balance work. 

  • Always review AI-generated queries and summaries before sharing 
  • Use LLMs to support analysts, not replace them 
  • Make sure the underlying data is reliable and well-documented 
  • Train business users to validate outputs before acting 
  • Create workflows where human review is built in

Conclusion 

AI can enhance analytics but should not be used blindly. LLMs bring speed, scale, and support, but without human oversight they can also introduce costly errors that undermine trust in decision-making. 

Human judgment remains essential. It provides the reliability, context, and accountability that AI alone cannot deliver. Confidence doesn’t equal correctness, and sounding right isn’t the same as being right. 

The best results come from collaboration between humans and AI. Use LLMs as powerful partners, not replacements. How is your team approaching this balance? Explore how platforms like Datacoves can help you create workflows that keep humans in the loop.

Snowflake summit 2025
5 mins read

It is clear that Snowflake is positioning itself as an all-in-one platform—from data ingestion, to transformation, to AI. The announcements covered a wide range of topics, with AI mentioned over 60 times during the 2-hour keynote. While time will tell how much value organizations get from these features, one thing remains clear: a solid foundation and strong governance are essential to deliver on the promise of AI.

Snowflake Intelligence (Public Preview)

Conversational AI via natural language at ai.snowflake.com, powered by Anthropic/OpenAI LLMs and Cortex Agents, unifying insights across structured and unstructured data. Access is available through your account representative.  

Datacoves Take: Companies with strong governance—including proper data modeling, clear documentation, and high data quality—will benefit most from this feature. AI cannot solve foundational issues, and organizations that skip governance will struggle to realize its full potential.

Data Science Agent (Private Preview)

An AI companion for automating ML workflows—covering data prep, feature engineering, model training, and more.

Datacoves Take: This could be a valuable assistant for data scientists, augmenting rather than replacing their skills. As always, we'll be better able to assess its value once it's generally available.

Cortex AISQL (Public Preview)

Enables multimodal AI processing (like images, documents) within SQL syntax, plus enhanced Document AI and Cortex Search.

Datacoves Take: The potential here is exciting, especially for teams working with unstructured data. But given historical challenges with Document AI, we’ll be watching closely to see how this performs in real-world use cases.

AI Observability in Cortex AI (GA forthcoming)

No-code monitoring tools for generative AI apps, supporting LLMs from OpenAI (via Azure), Anthropic, Meta, Mistral, and others.

Datacoves Take: Observability and security are critical for LLM-based apps. We’re concerned that the current rush to AI could lead to technical debt and security risks. Organizations must establish monitoring and mitigation strategies now, before issues arise 12–18 months down the line.

Snowflake Openflow (GA on AWS)

Managed, extensible multimodal data ingestion service built on Apache NiFi with hundreds of connectors, simplifying ETL and change-data capture.

Datacoves Take: While this simplifies ingestion, GUI tools often hinder CI/CD and code reviews. We prefer code-first tools like DLT that align with modern software development practices. Note: Openflow requires additional AWS setup beyond Snowflake configuration.

dbt Projects on Snowflake (Public Preview)

Native dbt development, execution, monitoring with Git integration and AI-assisted code in Snowsight Workspaces.

Datacoves Take: While this makes dbt more accessible for newcomers, it’s not a full replacement for the flexibility and power of VS Code. Our customers rely on VS Code not just for dbt, but also for Python ingestion development, managing security as code, orchestration pipelines, and more. Datacoves provides an integrated environment that supports all of this—and more. See this walkthrough for details: https://www.youtube.com/watch?v=w7C7OkmYPFs

Enhanced Apache Iceberg support (Public/Private Preview)

Read/write Iceberg tables via Open Catalog, dynamic pipelines, VARIANT support, and Merge-on-Read functionality.

Datacoves Take: Interoperability is key. Many of our customers use both Snowflake and Databricks, and Iceberg helps reduce vendor lock-in. Snowflake’s support for Iceberg with advanced features like VARIANT is a big step forward for the ecosystem.

Modern DevOps extensions

Custom Git URLs, Terraform provider now GA, and Python 3.9 support in Snowflake Notebooks.

Datacoves Take: Python 3.9 is a good start, but we’d like to see support for newer versions. With PyPi integration, teams must carefully vet packages to manage security risks. Datacoves offers guardrails to help organizations scale Python workflows safely.

Snowflake Semantic Views (Public Preview)

Define business metrics inside Snowflake for consistent, AI-friendly semantic modeling.

Datacoves Take: A semantic layer is only as good as the underlying data. Without solid governance, it becomes another failure point. Datacoves helps teams implement the foundations—testing, deployment, ownership—that make semantic layers effective.

Standard Warehouse Gen2 (GA)

Hardware and performance upgrades delivering ~2.1× faster analytics for updates, deletes, merges, and table scans.

Datacoves Take: Performance improvements are always welcome, especially when easy to adopt. Still, test carefully—these upgrades can increase costs, and in some cases existing warehouses may still be the better fit.

SnowConvert AI

Free, automated migration of legacy data warehouses, BI systems, and ETL pipelines with code conversion and validation.

Datacoves Take: These tools are intriguing, but migrating platforms is a chance to rethink your approach—not just lift and shift legacy baggage. Datacoves helps organizations modernize with intention.

Cortex Knowledge Extensions (GA soon)

Enrich native apps with real-time content from publishers like USA TODAY, AP, Stack Overflow, and CB Insights.

Datacoves Take: Powerful in theory, but only effective if your core data is clean. Before enrichment, organizations must resolve entities and ensure quality.

Sharing of Semantic Models (Private Preview)

Internal/external sharing of AI-ready datasets and models, with natural language access across providers.

Datacoves Take: Snowflake’s sharing capabilities are strong, but we see many organizations underutilizing them. Effective sharing starts with trust in the data—and that requires governance and clarity.

Agentic Native Apps Marketplace

Developers can build and monetize Snowflake-native, agent-driven apps using Cortex APIs.

Datacoves Take: Snowflake has long promoted its app marketplace, but adoption has been limited. We’ll be watching to see if the agentic model drives broader use.

Improvements to Native App Framework

Versioning, permissions, app observability, and compliance badging enhancements.

Datacoves Take: We’re glad to see Snowflake adopting more software engineering best practices—versioning, observability, and security are all essential for scale.

Snowflake Adaptive Compute (Private Preview)

Auto-scaling warehouses with intelligent routing for performance optimization without cost increases.

Datacoves Take: This feels like a move toward BigQuery’s simplicity model. We’ll wait to see how it performs at scale. As always, test before relying on this in production.

Horizon Catalog Interoperability & Copilot (Private Preview)

Enhanced governance across Iceberg tables, relational DBs, dashboards, with natural-language metadata assistance.

Datacoves Take: Governance is core to successful data strategy. While Horizon continues to improve, many teams already use mature catalogs. Datacoves focuses on integrating metadata, ownership, and lineage across tools—not locking you into one ecosystem.

Security enhancements

Trust Center updates, new MFA methods, password protections, and account-level security improvements.

Datacoves Take: The move to enforce MFA and support for Passkeys is a great step. Snowflake is making it easier to stay secure—now organizations must implement these features effectively.

Enhanced observability tools

Upgrades to Snowflake Trail, telemetry for Openflow, and debug/monitor tools for Snowpark containers and GenAI agents/apps.

Datacoves Take: Observability is critical. Many of our customers build their own monitoring to manage costs and data issues. With these improvements, Snowflake is catching up—and Datacoves complements this with pipeline-level observability, including Airflow and dbt.

Read the full post from Snowflake here:
https://www.snowflake.com/en/blog/announcements-snowflake-summit-2025/

Hidden cost of no code ETL
5 mins read

The Hidden Costs of no code ETL Tools: 10 Reasons They Don’t Scale

"It looked so easy in the demo…"
— Every data team, six months after adopting a drag-and-drop ETL tool

If you lead a data team, you’ve probably seen the pitch: Slick visuals. Drag-and-drop pipelines. "No code required." Everything sounds great — and you can’t wait to start adding value with data!

At first, it does seem like the perfect solution: non-technical folks can build pipelines, onboarding is fast, and your team ships results quickly.

But our time in the data community has revealed the same pattern over and over: What feels easy and intuitive early on becomes rigid, brittle, and painfully complex later.

Let’s explore why no code ETL tools can lead to serious headaches for your data preparation efforts.

What Is ETL (and Why It Matters)?

Before jumping into the why and the how, let’s start with the what.

When data is created in its source systems it is never ready to be used for analysis as is. It always needs to be massaged and transformed for downstream teams to gather any insights from the data. That is where ETL comes in. ETL stands for Extract, Transform, Load. This is the process of moving data from multiple sources, reshaping (transforming) it, and loading it into a system where it can be used for analysis.

At its core, ETL is about data preparation:

  • Extracting raw data from different systems
  • Transforming it — cleaning, standardizing, joining, and applying business logic
  • Loading the refined data into a centralized destination like a data warehouse

Without ETL, you’re stuck with messy, fragmented, and unreliable data. Good ETL enables better decisions, faster insights, and more trustworthy reporting. Think of ETL as the foundation that makes dashboards, analytics, Data Science, Machine Learning, GenAI, and lead to data-driven decision-making even possible.

Data-driven decision making

Now the real question is how do we get from raw data to insights? That is where the topic of tooling comes into the picture. While this might be at a very high-level, we categorize tools into two categories: Code-based and no-code/low-code. Let’s look at these categories in a little more detail. 

What Are Code-Based ETL Tools?

Code-based ETL tools require analysts to write scripts or code to build and manage data pipelines. This is typically done with programming languages like SQL, Python, possibly with specialized frameworks, like dbt, tailored for data workflows.

Instead of clicking through a UI, users define the extraction, transformation, and loading steps directly in code — giving them full control over how data moves, changes, and scales.

Common examples of code-based ETL tooling include dbt (data build tool), SQLMesh, Apache Airflow, and custom-built Python scripts designed to orchestrate complex workflows.

While code-based tools often come with a learning curve, they offer serious advantages:

  • Greater flexibility to handle complex business logic
  • Better scalability as data volumes and pipeline complexity grow
  • Stronger maintainability through practices like version control, testing, and modular development

Most importantly, code-based systems allow teams to treat pipelines like software, applying engineering best practices that make systems more reliable, auditable, and adaptable over time.

Building and maintaining robust ETL pipelines with code requires up-front work to set up CI/CD and developers who understand SQL or Python. Because of this investment in expertise, some teams are tempted to explore whether the grass is greener on the other side with no-code or low-code ETL tools that promise faster results with less engineering complexity. No hard-to-understand code, just drag and drop via nice-looking UIs. This is certainly less intimidating than seeing a SQL query.

What Are No-Code ETL Tools?

As you might have already guessed, no-code ETL tools let users build data pipelines without writing code. Instead, they offer visual interfaces—typically drag-and-drop—that “simplify” the process of designing data workflows.

What Are No-Code ETL Tools?

These tools aim to make data preparation accessible to a broader audience reducing complexity by removing coding. They create the impression that you don't need skilled engineers to build and maintain complex pipelines, allowing users to define transformations through menus, flowcharts, and configuration panels—no technical background required.

However, this perceived simplicity is misleading. No-code platforms often lack essential software engineering practices such as version control, modularization, and comprehensive testing frameworks. This can lead to a buildup of technical debt, making systems harder to maintain and scale over time. As workflows become more complex, the initial ease of use can give way to a tangled web of dependencies and configurations, challenging to untangle without skilled engineering expertise. Additional staff is needed to maintain data quality, manage growing complexity, and prevent the platform from devolving into a disorganized state. Over time, team velocity decreases due to layers of configuration menus.

Popular no-code ETL tools include Matillion, Talend, Azure Data Factory(ADF), Informatica, Talend, and Alteryx. They promise minimal coding while supporting complex ETL operations. However, it's important to recognize that while these tools can accelerate initial development, they may introduce challenges in long-term maintenance and scalability.

To help simplify why best-in-class orginazations typically avoid no-code tools, we've come up with 10 reasons that highlight their limitations.

🔟 Reasons GUI-Based ETL Tools Don’t Scale

1. Version control is an afterthought

Most no-code tools claim Git support, but it's often limited to unreadable exports like JSON or XML. This makes collaboration clunky, audits painful, and coordinated development nearly impossible.

Bottom Line: Scaling a data team requires clean, auditable change management — not hidden files and guesswork.

2. Reusability is limited

Without true modular design, teams end up recreating the same logic across pipelines. Small changes become massive, tedious updates, introducing risk and wasting your data team’s time. $$$

Bottom Line: When your team duplicates effort, innovation slows down.

3. Debugging is frustrating

When something breaks, tracing the root cause is often confusing and slow. Error messages are vague, logs are buried, and troubleshooting feels like a scavenger hunt. Again, wasting your data team’s time.

Bottom Line: Operational complexity gets hidden behind a "simple" interface — until it’s too late and it starts costing you money.

4. Testing is nearly impossible

Most no-code tools make it difficult (or impossible) to automate testing. Without safeguards, small changes can ripple through your pipelines undetected. Users will notice it in their dashboards before your data teams have their morning coffee.

Bottom Line: If you can’t trust your pipelines, you can’t trust your dashboards or reports.

5. They eventually require code anyway

As requirements grow, "no-code" often becomes "some-code." But now you’re writing scripts inside a platform never designed for real software development. This leads to painful uphill battles to scale.

Bottom Line: You get the worst of both worlds: the pain of code, without the power of code.

6. Poor team collaboration

Drag-and-drop tools aren’t built for teamwork at scale. Versioning, branching, peer review, and deployment pipelines — the basics of team productivity — are often afterthoughts. This makes it difficult for your teams to onboard, develop and collaborate. Less innovation, less insights, and more money to deliver insights!

Bottom Line: Without true team collaboration, scaling people becomes as hard as scaling data.

7. Vendor lock-in is real

Your data might be portable, but the business logic that transforms it often isn't. Migrating away from a no-code tool can mean rebuilding your entire data stack from scratch. Want to switch tooling for best-in-class tools as the data space changes? Good luck. 

Bottom Line: Short-term convenience can turn into long-term captivity.

8. Performance problems sneak up on you

When your data volume grows, you often discover that what worked for a few million rows collapses under real scale. Because the platform abstracts how work is done, optimization is hard — and costly to fix later. Your data team will struggle to lower that bill more than they would with fine tune code-based tools. 

Bottom Line: You can’t improve what you can’t control.

9. Developers don’t want to touch them

Great analysts prefer tools that allow precision, performance tuning, and innovation. If your environment frustrates them, you risk losing your most valuable technical talent. Onboarding new people is expensive; you want to keep and cultivate the talent you do have. 

Bottom Line: If your platform doesn’t attract builders, you’ll struggle to scale anything.

10. They trade long-term flexibility for short-term ease

No-code tools feel fast at the beginning. Setup is quick, results come fast, and early wins are easy to showcase. But as complexity inevitably grows, you’ll face rigid workflows, limited customization, and painful workarounds. These tools are built for simplicity, not flexibility and that becomes a real problem when your needs evolve. Simple tasks like moving a few fields or renaming columns stay easy, but once you need complex business logic, large transformations, or multi-step workflows, it is a different matter. What once sped up delivery now slows it down, as teams waste time fighting platform limitations instead of building what the business needs.

Bottom Line: Early speed means little if you can’t sustain it. Scaling demands flexibility, not shortcuts.

Conclusion

No-code ETL tools often promise quick wins: rapid deployment, intuitive interfaces, and minimal coding. While these features can be appealing, especially for immediate needs, they can introduce challenges at scale.

As data complexity grows, the limitations of no-code solutions—such as difficulties in version control, limited reusability, and challenges in debugging—can lead to increased operational costs and hindered team efficiency. These factors not only strain resources but can also impact the quality and reliability of your data insights. 

It's important to assess whether a no-code ETL tool aligns with your long-term data strategy. Always consider the trade-offs between immediate convenience and future scalability. Engaging with your data team to understand their needs and the potential implications of tool choices can provide valuable insights. 

What has been your experience with no-code ETL tools? Have they met your expectations, or have you encountered unforeseen challenges?

Get our free ebook dbt Cloud vs dbt Core

Get the PDF
Download pdf