Datacoves blog

Learn more about dbt Core, ELT processes, DataOps,
modern data stacks, and team alignment by exploring our blog.
Build vs buy Data Platform
dbt alternatives
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Dbt vs airflow
5 mins read

dbt transforms data inside your warehouse using SQL. Airflow orchestrates workflows across your entire data pipeline. They are not competing tools -- they solve different problems. dbt handles the "T" in ELT: modeling, data testing, unit testing, documentation, and lineage. Airflow handles the scheduling, sequencing, retries, failure notifications, observability, and coordination of every step in the pipeline, including ingesting data, triggering dbt, and activation steps. Most production data teams use both. The real decision is not which tool to pick, but how to run them together without drowning in infrastructure overhead.

What is Apache Airflow?

Airflow is a workflow scheduler, but it is only one part of a broader data orchestration problem that includes dependencies, retries, visibility, and ownership across the entire data lifecycle.

How Does Airflow Work?

Imagine a scenario where you have a series of tasks: Task A, Task B, and Task C. These tasks need to be executed in sequence every day at a specific time. Airflow enables you to programmatically define the sequence of steps as well as what each step does. With Airflow you can also monitor the execution of each step and get alerts when something fails.

Sample job with 3 tasks

What Sets Airflow Apart?

Airflow's core strength is flexibility. You define the code behind each task, not just the sequence and schedule. Whether it's triggering an extraction job, calling a Python script, or integrating with an external platform, Airflow lets you control exactly how each step executes.

But that flexibility is a double-edged sword. Just because you can code everything inside Airflow doesn't mean you should. Overly complex task logic inside Airflow makes workflows harder to debug and maintain. The best practice: use Airflow as the orchestrator, not the execution engine. Let specialized tools handle the heavy lifting. Use tools like dbt for transformation and tools like Azure Data Factory, dlt, Fivetran or Airbyte for extraction. Let Airflow handle the scheduling, coordination, observability, and failure notification.

Best Use Cases for Airflow

  • Data Extraction and Loading: Trigger external tools like Azure Data Factory, dlt, Fivetran, or Airbyte for data ingestion. When no off-the-shelf tool exists, Airflow can call custom Python scripts or frameworks like dlt to handle the load.
  • Data Transformation Coordination: Trigger dbt Core to run transformation steps in the right sequence with appropriate parallelism, directly in the data warehouse.
    ML Workflow Coordination: Orchestrate machine learning pipelines, from data preprocessing to model refreshes, ensuring each step runs after upstream transformations complete.
    Automated Reporting: Trigger report generation or data exports to BI and reporting tools on a schedule.
    System Maintenance and Monitoring: Schedule backups, manage log storage, and send alerts on system anomalies.
Airflow DAG with EL, T, and Activation

Airflow Benefits

  • Job Management: Define workflows and task dependencies with a built-in scheduler that handles sequencing and synchronization.
  • Retry Mechanism: Automatically retry failed tasks or entire pipelines with configurable retry logic, ensuring resilience and fault tolerance.
  • Alerting: Send failure notifications to email, Slack, MS Teams, or other tools so your team knows immediately when something breaks.
  • Monitoring: Track workflow status, task durations, and historical runs through the Airflow UI.
  • Scalability: Deploy on Kubernetes to scale up during peak processing and scale down when idle, keeping infrastructure costs aligned with actual usage.
  • Community: Active open-source project with robust support on GitHub and Slack, plus a deep library of learning resources.

Airflow Challenges

  • Python Required: Airflow is Python-based, so creating workflows requires Python proficiency.
  • Learning Curve: Concepts like DAGs, operators, hooks, and task dependencies take time to master, especially for team members without prior orchestration experience.
  • Production Deployment: Setting up a production-grade Airflow environment requires knowledge of Kubernetes, networking, and infrastructure tuning.
  • Ongoing Maintenance: Upgrading Airflow and its underlying infrastructure carries real cost and complexity, particularly for teams that don't manage distributed systems regularly.
  • Debugging: Identifying root causes in a complex Airflow environment requires experience. Knowing where to start is half the battle.
  • Cost of Ownership: Airflow is open-source and free to use, but the total cost of deployment, tuning, and ongoing support adds up quickly.
  • Development Experience: Testing DAGs locally requires setting up Docker-based Airflow environments, which adds friction and slows iteration cycles.

What is dbt?

dbt (data build tool) is an open-source framework that uses SQL and Jinja templating to transform data inside the warehouse. Developed by dbt Labs, dbt handles the "T" in ELT: transforming, testing, documenting, and tracking lineage of data models. It brings software engineering practices like version control, CI/CD, and modular code to analytics workflows.

How Does dbt Work?

dbt lets you write transformation logic as SQL select statements, enhanced with Jinja templating for dynamic execution. Each transformation is called a "model." When you run dbt, it compiles those models into raw SQL and executes them against your warehouse, creating or replacing tables and views. dbt resolves dependencies between models automatically, running them sequentially or in parallel as needed.

What Sets dbt Apart?

Unlike traditional ETL tools that abstract SQL behind drag-and-drop interfaces, dbt treats SQL as the primary language for transformation. This makes it immediately accessible to anyone who already knows SQL. But dbt goes further by adding Jinja templating for dynamic scripting, conditional logic, and reusable macros.

A few things set dbt apart from alternatives:

  • Idempotency: dbt transformations produce the same result every time they run, making pipelines predictable and repeatable.
  • Testing built in: Define data quality tests alongside your models. No need for a separate testing tool.
  • Auto-generated documentation: dbt produces a web-based docs site with descriptions, lineage graphs, and metadata -- all generated directly from your project.
  • Open metadata: dbt outputs metadata files (artifacts) that downstream tools like data catalogs and observability platforms can consume. This format is quickly becoming the standard for sharing information across data platforms.

dbt excels at the "T" in ELT but requires complementary tools for extraction, loading, and orchestration.

Best Use Cases for dbt

  • Data Transformation for Analytics: Transform and aggregate raw data into models ready for BI platforms, analytics tools, and ML pipelines.
  • SQL-based Workflow Creation: Build modular transformation workflows using SQL and Jinja, with reusable macros and dynamic script generation.
  • Data Validation and Testing: Define schema tests and custom tests alongside your models to catch data quality issues before they reach end users.
  • SQL Unit Testing: Verify that a specific transformation works as expected by providing sample input data and comparing against expected output.
  • Documentation and Lineage: Auto-generate a documentation site with descriptions, column-level lineage, and dependency graphs to simplify impact analysis and debugging.
  • Version Control and DataOps: Manage transformations through Git-based workflows with branching, pull requests, and environment-specific deployments.

dbt Benefits

  • SQL-based: dbt uses SQL and Jinja for transformation, so most data analysts and engineers can contribute without learning a new language.
  • Modularity: Each transformation is a SQL select statement that builds on other models. This modular approach simplifies debugging, testing, and reuse.
  • Testing: Built-in testing framework lets you validate data quality at every layer without introducing additional tools.
  • Documentation: Auto-generated docs site exposes descriptions, lineage, and metadata, making it easier to maintain projects over time and helping data consumers find and understand data before use.
  • Lineage and Impact Analysis: Visual dependency graphs simplify debugging when issues occur and help you trace which sources feed a specific data product like an executive dashboard.
  • Open-source: dbt Core is open-source with no vendor lock-in.
  • Packages and Libraries: A large community maintains reusable dbt packages and Python libraries that extend dbt's functionality out of the box.
  • Open Metadata: dbt produces artifact files that data catalogs and observability tools can consume, making it a de facto standard for sharing metadata across platforms.
  • AI-Ready Foundation: dbt's structured metadata, lineage, and documentation make your warehouse AI-ready. When dbt artifacts are available in your warehouse (e.g. Snowflake), AI tools can use that context to understand table relationships, column descriptions, and data quality rules, turning your transformation layer into a knowledge layer that LLMs and copilots can reason over. See Snowflake Cortex CLI running in Datacoves.
  • Community: 100k+ members on Slack, extensive learning resources, and a growing ecosystem of practitioners solving similar data problems.
dbt Lineage Graph
dbt Lineage Graph

dbt Challenges

  • Transformation only: dbt handles the "T" in ELT. Extraction, loading, orchestration, and downstream activation all require separate tools.
  • Macros complexity: dbt macros are powerful but can be hard to read, debug, and test, especially for analysts who are primarily SQL users.
  • Infrastructure management: Running dbt Core in production requires setting up CI/CD, environment management, secrets handling, and Git workflows. This overhead grows as the team and project scale.
  • No built-in orchestration: dbt Core has no scheduler. You need an external orchestrator like Airflow to automate runs and coordinate dbt with the rest of your pipeline.
  • Multi-project complexity: As organizations grow into multiple dbt projects, managing cross-project dependencies, shared models, and consistent conventions becomes a significant challenge.

Common Misconception: dbt Core vs. dbt Cloud

A common misunderstanding in the data community is equating "dbt" with "dbt Cloud." When people say dbt, they typically mean dbt Core -- the open-source framework. dbt Cloud is a commercial product from dbt Labs built on top of dbt Core that adds a scheduler, hosted IDE, monitoring, and managed infrastructure.

The key distinction: you can use dbt Core without paying for dbt Cloud, but you won't get a scheduler, hosted environment, or managed CI/CD. That means you'll need to solve orchestration, deployment, and infrastructure yourself. This is typically done with a tool like Airflow and a platform to tie it all together.

For a deeper comparison, see our article on dbt Cloud vs. dbt Core.

dbt Cloud's Scheduler: What It Does and Doesn't Cover

dbt Cloud includes a scheduler that automates dbt runs on a defined cadence. This keeps your transformation models fresh and your data reliable. But it only schedules dbt jobs, nothing else.

Your Extract and Load steps, downstream processes like BI refreshes, ML model retraining, marketing automation, and any cross-system coordination still need an orchestrator. dbt Cloud's scheduler does not replace Airflow. It replaces the need to cron-schedule dbt runs manually, but leaves the rest of the pipeline unmanaged.

Do I need Airflow if I use dbt Cloud?

Yes, unless your entire pipeline is just dbt transformations with nothing before or after them.

dbt Cloud schedules and runs your dbt jobs. That covers transformation, testing, and documentation. But most production pipelines involve more than that: extracting data from source systems, loading it into the warehouse, refreshing BI dashboards, triggering ML model retraining, sending alerts, and coordinating dependencies across all of these steps.

dbt Cloud does not orchestrate any of those. If your pipeline looks like this:

Extract → Load → Transform → Activate

Then dbt Cloud only covers one step. You still need an orchestrator like Airflow to tie the full pipeline together, manage dependencies between stages, handle retries, and provide end-to-end visibility when something breaks.

What dbt cloud's scheduler actually covers

For small teams with a simple pipeline (one source, one warehouse, one dashboard), dbt Cloud's scheduler may be enough to start. But as soon as you add sources, downstream consumers, or cross-system dependencies, you'll need Airflow or a platform that includes it.

Managed dbt Core + Managed Airflow: Removing the Infrastructure Burden

The hidden cost of choosing open-source dbt Core and Airflow is the platform you have to build around them. CI/CD pipelines, secrets management, environment provisioning, Git workflows, Kubernetes tuning, Airflow upgrades -- this is real engineering work that pulls your team away from delivering data products.

Running dbt + Airflow: Build it yourself vs Managed platform

Datacoves eliminates that overhead. It provides managed dbt Core and managed Airflow in a unified environment deployed in your private cloud. Developers get an in-browser VS Code editor with dbt, Python, extensions, and Git pre-configured. Orchestration is handled through Datacoves' managed Airflow, which includes a simplified YAML-based DAG configuration that connects Extract, Load, and Transform steps without writing boilerplate Python.

Best practices, governance guardrails, and accelerators are built in, so teams get a production-grade data platform in minutes instead of months. For a deeper look at the tradeoffs, see our guide on building vs. buying your analytics platform.

To learn more about Datacoves, check out our product page.

Airflow DAG in YAML format
Airflow DAG in YAML format

dbt vs Airflow: Which One Should You Choose?

This is not an either-or decision. dbt and Airflow solve different problems, and trying to force one tool to do the other's job creates fragile, hard-to-maintain pipelines.

Use dbt for transformation, testing, documentation, and lineage. Use Airflow for scheduling, dependency management, retries, and end-to-end pipeline coordination. The comparison table below shows where each tool fits.

Which Tool Should You Adopt First?

If you need to pick one tool to start with, here's the decision framework:

Start with dbt if your immediate priority is transforming raw data into reliable, tested models in the warehouse. dbt gives you version control, testing, documentation, and lineage from day one. For small teams with simple pipelines, a cron job or dbt Cloud's scheduler can handle automation in the short term.

Start with Airflow if you already have multiple systems that need coordination; extraction from SaaS sources, loading into the warehouse, transformation, and downstream activation. Airflow gives you end-to-end visibility and control across the full pipeline.

The reality: most teams outgrow a single-tool approach quickly. Once your pipeline spans more than one step, you need both. The question isn't which tool to pick, it's how soon you'll need the other one. If you're still evaluating transformation tools, check out our dbt alternatives comparison.

How dbt and Airflow Work Together

In a production pipeline, the tools fit together like this:

  1. Airflow triggers extraction and loading, kicking off tools like dlt, Airbyte, Fivetran, or Azure Data Factory to pull data from source systems into the warehouse.
  2. Airflow triggers dbt. Once data lands in the warehouse, Airflow calls dbt to run transformations and tests in the correct sequence.
  3. Airflow handles downstream activation. After transformation, Airflow pushes data to BI platforms, refreshes ML models, triggers marketing automation, or sends alerts.
Who owns what: Airflow vs dbt in the data pipeline

Airflow owns the "when" and "what order." dbt owns the "how" for transformation. Neither tool steps on the other's responsibilities.

The challenge is running them together. Deploying Airflow and dbt Core in a shared, production-grade environment requires Kubernetes, CI/CD, secrets management, environment provisioning, and ongoing maintenance. For teams that don't want to build and maintain that platform layer themselves, a managed solution like Datacoves provides both tools pre-integrated in your private cloud -- so your team focuses on data, not infrastructure.

Conclusion: dbt and Airflow Are Better Together

dbt and Airflow are not competing tools. dbt transforms your data. Airflow orchestrates your pipeline. Trying to use one without the other leaves gaps that show up as fragile pipelines, manual workarounds, and constant firefighting. The teams that struggle most are the ones who pick the right tools but never invest in the platform layer that makes them work together.

The real question is not "dbt vs Airflow," it's how you run them together without building and maintaining a platform from scratch. That infrastructure layer (Kubernetes, CI/CD, secrets, environments, Git workflows, upgrades) is where most teams burn time and budget.

Datacoves gives you managed dbt Core and managed Airflow in a unified environment, deployed in your private cloud, with best practices and governance built in. Your team focuses on delivering data products, not maintaining the platform underneath them.

See how Datacoves works.

dbt testing options
5 mins read

dbt testing gives data teams a way to validatedata quality, transformation logic, and governance standards directly insidetheir development workflow. Out of the box, dbt Core includes generic tests (unique, not_null, accepted_values, relationships), singular tests for custom SQL assertions, and unit tests for verifying transformation logic before it reaches production. Community packages like dbt-utils and dbt-expectations add dozens more. Combined with CI/CD tools like dbt-checkpoint, teams can enforce testing and documentation standards on every pull request.

This guide covers every layer: what ships with dbt Core, which packages to add, how to store and review results, and how to integrate testing into your CI/CD pipeline.

dbt Testing in dbt Core: Built-in Test Types Explained

dbt has two main categories of tests: data tests and unit tests.

Data tests validate the integrity of actual warehouse data and run with every pipeline execution. They come in two forms: generic tests, defined in YAML and applied across models, and singular tests, written as standalone SQL assertions for specific conditions. Under the hood, dbt compiles each data test to a SQL SELECT statement and executes it against your database. If any rows are returned, dbt marks the test as failed.

Unit tests, introduced in dbt 1.8, validate transformation logic using static, predefined inputs. Unlike data tests, they are designed to run during development and CI only, not in production.

Categories of dbt tests
dbt Data and Unit Tests

dbt Test Types Quick Reference

Generic Tests in dbt Core

dbt Core ships with four built-in generic tests that cover the most common data quality checks.

  • unique verifies that every value in a column contains no duplicates. Use this on identifiers like customer_id or order_id to catch accidental duplication in your models.
  • not_null checks that a column contains no null values. This is especially useful for catching silent upstream failures where a field stops being populated.
  • accepted_values validates that a column only contains values from a defined list. For example, a payment_status column might only allow pending, failed, accepted, or rejected. This test will catch it if a new value like approved appears unexpectedly.
  • relationships checks referential integrity between two tables. If an orders table references customer_id, this test verifies that every customer_id in orders has a matching record in the customers table.

You apply generic tests by adding them to the model's property YAML file.

Validate information about each customers in the customer details table.
dbt Core Generic Tests

Generic tests also support additional configurations that give you more control over how and when they fail.

  • where limits the test to a subset of rows, useful on large tables where you want to test only recent data or exclude specific values.
  • severity controls whether a test failure blocks execution. Set to warn to flag issues without stopping the pipeline, or keep the default error for critical checks.
  • name overrides the auto-generated test name. Since dbt generates long default names, a custom name makes logs and audit tables much easier to read.
Same property file with the additional configurations defined

dbt tests with where condition, severity, and name defined

Singular Tests in dbt Core

Singular tests are custom SQL assertions saved in your tests/ directory. Each file contains a query that returns the rows failing the test. If the query returns any rows, dbt marks the test as failed.

Use singular tests when built-in generic tests or package tests do not cover your specific business logic. A good example: verifying that sales for one product stay within +/- 10% of another product over a given period. That kind of assertion is too specific for a reusable generic test but straightforward to express in SQL.

When you find yourself writing similar singular tests repeatedly across models, that is a signal to convert the logic into a custom generic test instead.

Dbt core singular tests
dbt singular data test example

Custom Generic Tests in dbt Core

Custom generic tests work like dbt macros. They can be stored in tests/generic/ or in your macros/ directory. Datacoves recommends tests/generic/ as the default location, but macros/ makes more sense when the test depends on complex macro logic. At minimum, a custom generic test accepts a model parameter, with an optional column_name if the test applies to a specific column. Additional parameters can be passed to make the test more flexible.

Once defined, a custom generic test can be applied across any model or column in your project, just like the built-in tests. This makes them the right choice when you have business logic that repeats across multiple models but is too specific to find in a community package.

Dbt core custom generic tests
Custom generic data test definition

Unit Tests in dbt Core (dbt 1.8+)

Unit tests, available natively since dbt 1.8, validate that your SQL transformation logic produces the expected output before data reaches production.

Unlike data tests that run against live warehouse data on every pipeline execution, unit tests use static, predefined inputs defined as inline values, seeds, or SQL queries in your YAML config. Because the expected outputs don’t change between runs, there’s no value in running unit tests in production. Run them locally during development and in your CI pipeline when new transformation code is introduced.

If your project is on a version older than dbt 1.8, upgrading is the recommended path. Community packages that previously filled this gap (dbt-unit-testing, dbt_datamocktool, dbt-unittest) are no longer actively maintained and are not recommended for new projects.

Source Freshness Checks in dbt Core

Source freshness checks aren’t technically dbt tests, but they solve one of the most common silent failure modes in data pipelines: a source stops updating, but the pipeline keeps running without any errors.

Freshness checks are configured in your sources.yml file with warn_after and error_after thresholds. When dbt detects that a source has not been updated within the defined window, it raises a warning or error before your models run. This is especially critical for time-sensitive reporting, where stale data can be worse than no data at all.

Dbt core freshness check
Source Freshness Check Configuration

dbt Testing Packages: Extending Beyond dbt Core

dbt Core's built-in tests cover the fundamentals, but the community has built a rich ecosystem of packages that extend testing well beyond what ships out of the box. Packages are installed via your packages.yml file and sourced from dbt Hub.

dbt-utils: 16 Additional Generic Tests

dbt-utils, maintained by dbt Labs, is the most widely used dbt testing package. It adds 16 generic tests alongside SQL generators and helper macros that complement dbt Core's built-in capabilities.

  • not_accepted_values is the inverse of accepted_values. Use it to assert that specific values are never present in a column.
  • equal_rowcount confirms that two tables have the same number of rows. This is particularly useful after transformation steps where row counts should be preserved.
  • fewer_rows_than validates that a target table has fewer rows than a source table, which is the expected result after any aggregation step.

For the full list of all 16 generic tests with usage examples, see the Datacoves dbt-utils cheatsheet.

dbt-expectations: 62 Tests Modeled After Great Expectations

dbt-expectations, maintained by Metaplane, ports the Python library Great Expectations into the dbt ecosystem. It gives analytics engineers 62 reusable generic tests without adding a separate tool to the stack.
The package covers seven categories:

  • Table shape (15 tests)
  • Missing values, unique values, and types (6 tests)
  • Sets and ranges (5 tests)
  • String matching (10 tests)
  • Aggregate functions (17 tests)
  • Multi-column (6 tests)
  • Distributional functions (3 tests)

The string matching and aggregate function categories in particular cover validations that would otherwise require custom singular tests. Full documentation is available on the dbt-expectations GitHub repository.

dbt_constraints: Enforce Primary and Foreign Keys

dbt_constraints, created by Snowflake, generates database-level primary key, unique key, and foreign key constraints directly from your existing dbt tests. It is primarily designed for Snowflake, with limited support on other platforms.

When added to a project, it automatically creates three types of constraints.

  • A primary_key is a unique, not-null constraint on one or more columns.
  • A unique_key is a uniqueness constraint, including support for composite keys via dbt_utils.unique_combination_of_columns.
  • A foreign_key is a referential integrity constraint derived from existing relationships tests.

This package is most valuable for teams that want database constraints alongside dbt's test-based validation, not instead of it. Snowflake doesn’t enforce most constraints at write time, but the query optimizer uses them for better execution plans. Some tools also read constraints to reverse-engineer data model diagrams or add joins automatically between tables.

For most dbt projects, dbt-utils and dbt-expectations are the two packages worth adding first. Layer in dbt_constraints when your use case specifically calls for database-level constraints on Snowflake.

Storing and Reviewing dbt Test Results

By default, dbt doesn’t persist test results between runs. You can change this by setting store_failures: true in your dbt_project.yml or at the individual test level.

This is useful for spot-checking failures after a run, but each execution overwrites the previous results, so it does not give you historical visibility into data quality trends over time. The tools below address that gap with lightweight reporting options.

Using store_failures to Capture Failing Records

When store_failures: true is enabled at the project or test level, dbt writes the rows that fail each test into tables in your warehouse under a schema named [your_schema]_dbt_test__audit. This creates one table per test, containing the actual records that triggered the failure.

This is a practical first step for teams that want to inspect failures without setting up a dedicated observability platform. The main limitation: each run overwrites the previous results, so you cannot track failure trends over time without additional tooling.

Datacoves recommends dbt audit tables in a separate database to keep your production environment clean. On platforms like Snowflake, this also reduces overhead for operations like database cloning.

dq-tools: Visualize dbt Test Results in Your BI Dashboard

dq-tools stores dbt test results and makes them available for visualization in a BI dashboard. If your team already has a BI layer in place, this is a lightweight way to surface data quality metrics alongside existing reporting without adding a separate observability tool.

datacompy: Table-Level Validation for Data Migration

datacompy is a Python library for migration validation. It performs detailed table comparisons, reporting on row-level differences, column mismatches, and statistical summaries. It’s not a dbt package, but it integrates naturally into migration workflows that use dbt for transformation.

Standard dbt tests aren’t designed for row-by-row comparison across systems, which makes datacompy a useful complement when you’re migrating from legacy platforms.

dbt Testing During Development and CI/CD

The tests covered so far validate data. The tools below validate your dbt project itself, catching governance and compliance issues at the development and PR stage rather than in production.

dbt-checkpoint: Pre-Commit Hooks for dbt Core Teams

dbt-checkpoint, maintained by Datacoves, enforces project governance standards automatically so teams don’t rely on manual code review to catch missing documentation, unnamed columns, or hardcoded table references. It runs as a pre-commit hook, blocking non-compliant code before it’s pushed to the main branch, and can also run in CI/CD pipelines to enforce the same checks on every pull request.

Out of the box it validates things like whether models and columns have descriptions, whether all columns defined in SQL are present in the corresponding YAML property file, and whether required tags or metadata are in place.

It's a natural fit for dbt Core teams that want Git-native governance without additional infrastructure. The tradeoff: it requires familiarity with pre-commit configuration to set up and extend. Like all Python-based governance tools, it doesn't run inside the dbt Cloud IDE.

dbt-bouncer: Artifact-Based Convention Enforcement

dbt-bouncer, maintained by Xebia, takes an artifact-based approach, validating against manifest.json, catalog.json, and run_results.json rather than running as a pre-commit hook. It requires no direct database connection and can run in any CI/CD pipeline that has access to dbt artifacts. Checks cover naming patterns, directory structure, and description coverage, and it can run as a GitHub Action or standalone Python executable.

dbt-project-evaluator: DAG and Structure Linting from dbt Labs

dbt-project-evaluator, maintained by dbt Labs, takes a different approach. Rather than running as an external tool, it is a dbt package: it materializes your project's DAG structure into your warehouse and runs dbt tests against it.

It checks for DAG issues (model fanout, direct joins to sources, unused sources), testing and documentation coverage, naming convention violations, and performance problems like models that should be materialized differently.

The main limitation is adapter support: it works on BigQuery, Databricks, PostgreSQL, Redshift, and Snowflake, but not on Fabric or Synapse. Because it materializes models into your warehouse, it has a slightly higher execution cost than artifact-based tools.

dbt-score: Metadata Linting and Model Scoring

dbt-score, maintained by Picnic Technologies, is a Python-based CLI linter that reads your manifest.json and assigns each model a score from 0 to 10 based on metadata quality: missing descriptions, absent owners, undocumented columns, models without tests.

The scoring approach makes it easy to track improvement over time and prioritize which models need attention, without enforcing hard pass/fail gates on every PR. Custom rules are fully supported. It requires no database connection and no dbt run, just a manifest.json.

How to Choose a dbt CI/CD Governance Tool

These tools aren’t mutually exclusive, and many teams combine them. For dbt Core teams, dbt-checkpoint is the recommended starting point: it enforces governance at the PR stage with minimal setup and can run both locally and in CI/CD pipelines. dbt-bouncer is worth evaluating if your team wants artifact-based validation running in an external CI/CD pipeline. Add dbt-project-evaluator when you want DAG-level structural checks alongside documentation and testing coverage. Layer in dbt-score for ongoing visibility into metadata quality across your project without hard enforcement gates.

Where to Start With dbt Testing

A Practical Progression from First Tests to Full Governance

dbt testing isn’t an all-or-nothing investment. The most effective approach is to start with what ships in dbt Core and expand coverage as your team builds confidence.

A practical progression looks like this: start with unique, not_null, and relationships tests on source tables and mart models. Add dbt-utils and dbt-expectations as your testing needs grow beyond the basics. When your team is ready to enforce governance and metadata standards, dbt-checkpoint is the simplest starting point for dbt Core teams, with dbt-project-evaluator and dbt-score as complementary layers.

If your organization is still running transformations in legacy tools like Talend, Informatica, or Python, you do not need to wait for a full migration to start benefiting from dbt testing. dbt sources can point to any table in your warehouse, whether or not dbt created it, so you can layer dbt tests and documentation on top of your existing pipeline today. See Using dbt to Document and Test Data Transformed with Other Tools for a step-by-step walkthrough.

Where teams hit friction isn’t usually the tests themselves. It’s managing the infrastructure, environments, CI/CD pipelines, and orchestration around them. Datacoves handles that layer, giving your team a managed dbt and Airflow environment so they can focus on building and testing models rather than maintaining tooling. Learn more about how Datacoves compares to other dbt solutions.

Dbt cheat sheet
5 mins read

You now know what dbt (data build tool) is all about.  You are being productive, but you forgot what `dbt build` does or you forgot what the @ dbt graph operator does. This handy dbt cheat sheet has it all in one place.

dbt cheat sheet - Updated for dbt 1.8

With the advent of dbt 1.6, we updated the awesome dbt cheat sheet created originally by Bruno de Lima

We have also moved the dbt jinja sheet sheet to a dedicated post.

This reference summarizes all the dbt commands you may need as you run your dbt jobs or study for your dbt certification.

If you ever wanted to know what the difference between +model and @model is in your dbt run, you will find the answer. Whether you are trying to understand dbt graph operators or what the dbt retry command does, but this cheat sheet has you covered. Check it out below.

Primary dbt commands

These are the principal commands you will use most frequently with dbt. Not all of these will be available on dbt Cloud

dbt Command arguments

The dbt commands above have options that allow you to select and exclude models as well as deferring to another environment like production instead of building dependent models for a given run. This table shows which options are available for each dbt command

dbt selectors

By combining the arguments above like "-s" with the options below, you can tell dbt which items you want to select or exclude. This can be a specific dbt model, everything in a specific folder, or now with the latest versions of dbt, the specific version of a model you are interested in.

dbt node selectors
tag Select models that match a specified tag
source Select models that select from a specified source
path Select models/sources defined at or under a specific path.
file / fqn Used to select a model by its filename, including the file extension (.sql).
package Select models defined within the root project or an installed dbt package.
config Select models that match a specified node config.
test_type Select tests based on their type, singular or generic, data, or unit (unit tests are available only in dbt 1.8)
test_name Select tests based on the name of the generic test that defines it.
state Select nodes by comparing them against a previous version of the same project, which is represented by a manifest. The file path of the comparison manifest must be specified via the --state flag or DBT_STATE environment variable.
exposure Select parent resources of a specified exposure.
metric Select parent resources of a specified metric.
result The result method is related to the state method described above and can be used to select resources based on their result status from a prior run.
source_status Select resource based on source freshness
group Select models defined within a group
access Selects models based on their access property.
version Selects versioned models based on their version identifier and latest version.

dbt graph operators

dbt Graph Operator provide a powerful syntax that allow you to hone in on the specific items you want dbt to process.

dbt graph operators
+ If "plus" (+) operator is placed at the front of the model selector, + will select all parents of the selected model. If placed at the end of the string, + will select all children of the selected model.
n+ With the n-plus (n+) operator you can adjust the behavior of the + operator by quantifying the number of edges to step through.
@ The "at" (@) operator is similar to +, but will also include the parents of the children of the selected model.
* The "star" (*) operator matches all models within a package or directory.

Project level dbt commands

The following commands are used less frequently and perform actions like initializing a dbt project, installing dependencies, or validating that you can connect to your database.

dbt command line (CLI) flags

The flags below immediately follow the dbt command and go before the subcommand e.g. dbt <FLAG> run

Read the official dbt documentation

dbt command line (CLI) flags (general)
-x, --fail-fast / --no-fail-fast Stop dbt execution as soon as a failure occurs.
-h, --help Shows command help documentation
--send-anonymous-usage-stats / --no-send-anonymous-usage-stats Send anonymous dbt usage statistics to dbt Labs.
-V, -v, --version Returns information about the installed dbt version
--version-check / --no-version-check Ensures or ignores that the installed dbt version matches the require-dbt-version specified in the dbt_project.yml file.
--warn-error If dbt would normally warn, instead raise an exception.
--warn-error-options WARN_ERROR_OPTIONS Allows for granular control over exactly which types of warnings are treated as errors. This argument receives a YAML string like '{"include": "all"}.
--write-json / --no-write-json Whether or not to write the manifest.json and run_results.json files to the target directory

dbt CLI flags (logging and debugging)
-d, --debug / --no-debug Display debug logging during dbt execution useful for debugging and making bug reports. Not to be confused with the dbt debug command which tests database connection.
--log-cache-events / --no-log-cache-events Enable verbose logging for relational cache events to help when debugging.
--log-format [text|debug|json|default] Specify the format of logging to the console and the log file.
--log-format-file [text|debug|json|default] Specify the format of logging to the log file by overriding the default format
--log-level [debug|info|warn|error|none] Specify the severity of events that are logged to the console and the log file.
--log-level-file [debug|info|warn|error|none] Specify the severity of events that are logged to the log file by overriding the default log level
--log-path PATH Configure the 'log-path'. Overrides 'DBT_LOG_PATH' if it is set.
--print / --no-print Outputs or hides all {{ print() }} statements within a macro call.
--printer-width INTEGER Sets the number of characters for terminal output
-q, --quiet / --no-quiet Suppress all non-error logging to stdout Does not affect {{ print() }} macro calls.
--use-colors / --no-use-colors Specify whether log output is colorized in the terminal
--use-colors-file / --no-use-colors-file Specify whether log file output is colorized

dbt CLI flags (parsing and performance)
--cache-selected-only / --no-cache-selected-only Have dbt cache or not cache metadata about all the objects in all the schemas where it might materialize resources
--partial-parse / --no-partial-parse Uses or ignores the pickle file in the target folder used to speed up dbt invocations by only reading and parsing modified objects.
--populate-cache / --no-populate-cache At start of run, use `show` or `information_schema` queries to populate a relational cache to speed up subsequent materializations
-r, --record-timing-info PATH Saves performance profiling information to a file that can be visualized with snakeviz to understand the performance of a dbt invocation
--static-parser / --no-static-parser Use or disable the static parser. (e.g. no partial parsing if enabled)
--use-experimental-parser / --no-use-experimental-parser Enable experimental parsing features.

As a managed dbt Core solution, the Datacoves platform simplifies the dbt Core experience and retains its inherent flexibility. It effectively bridges the gap, capturing many benefits of dbt Cloud while mitigating the challenges tied to a pure dbt Core setup. See if Datacoves dbt pricing is right for your organization or visit our product page.

Please contact us with any errors or suggestions.

Transformative power of data modeling in data
5 mins read

I read an article on Anchor Data Modeling, more specifically, Atomic modeling where the author proposes a different way of Data Modeling. The rationale for this change is that there is a lack of skills to model data well. We are giving powerful tools to novices, and that is bound to lead to problems.

From the article:

"we are in a distressful situation both concerning the art as a whole but also its place in modern architectures"

Is this the case? Do we have a big problem on the horizon that requires us to make this big shift?

I'd say I am open-minded and expose myself to different ways of thinking so I can broaden my views. A few years ago, I learned a bit about COBOL, not because I had any real use for it but because I was curious. I found it very interesting and even saw its similarities with SQL. I approached the topic with no preconceived ideas; this is the first time I heard of Atomic Modeling.

Is there something wrong with atomic and anchor data modeling in general?

The issues I see with ideas like Atomic data modeling are not in their goal. I am 100% aligned with the goal; the problem is the technology, process, and people needed to get there.

What we see in the market is a direct result of a backlash against doing things "perfectly." But why is this the case?  I believe it is because we haven't communicated how we will achieve this vision of ideas like atomic data. The author even says a key phrase in the first paragraph:

"practitioners shying away from its complexity"

If doing anchor data modeling is "complex" how are we going to up-skill people? Is this feasible? I am happy if I can get more people to use SQL vs a GUI tool like Alterix 😁

Doing anchor data modeling is complex

I am by no means an expert. Yet, I am fairly technical, and if I am not convinced, how will we convince decision-makers?

As I read this article, here's what I am thinking:

1. First, I will need to deconstruct the data I get from some source like material master data form SAP. That will be a bunch of tables, and who is going to do all this data modeling? It sounds expensive and time-consuming.

2. I am going to need some tooling for this, and I am either going to build it or use something a few others are using. Will my company want to take a chance on something this early? This sounds risky.

3. After I deconstruct all this data, I need to catalog all these atoms. I now have a lot of metadata, and that's good, but is the metadata business-friendly?  We can't get people to add table descriptions how is this going to happen with this explosion of objects? Who will maintain it? How will we expose this? Is there a catalog for it already? Does that catalog have the other features people need? It sounds like I need to figure out a bunch of things, the biggest one being the change management aspect.

4. What sort of database will I use to store this? This is a great use case for a graph database. But graph databases are not widely adopted, and I have seen graph databases choke at scale. We can use a relational database, but these joins are going to be complex.  Someone may have figured all this out, but there's more tech and learning needed. It sounds like this will also add time and expense.

5. When I have managed to do all the above, I will need to construct what people can actually use. We need objects that work with tools that are available. I need to make relational tables I can query with SQL and viz tools, which are more user-friendly. This sounds like more work, more time, and more money.

I may have missed some steps and oversimplified what's needed for this type of change. I am also aware that I may not know what exists to solve all the above. However, if I don't know it, then there are a ton of other people who also don't know it and this is where we need to start. We need to understand how we will tactically achieve this "better" world.

How we tactically achieve a better world

What are we fixing?

I've had conversations on metadata-driven automation, and like atomic modeling, I am not clear on who we are helping and how. What are we improving and in what timeframe? In the end, it feels like we have optimized for something only a few companies can do. To do anchor modeling well would be a huge expense, and when things go wrong, there are several points of failure. When we look at business problems, we need to be sure to optimize the end-to-end system. We can't locally optimize one area because we are likely moving the problem somewhere else. This can be in terms of time, money, or usability.

Decision-makers are not interested in data modeling. They are expecting results and a faster time to market. It's hard enough getting people to do things "better." This is why I find it hard to imagine that we can get to this level of maturity any time soon.

What can we do instead?

There are incremental steps we can take to incorporate best practices into the modern data stack. We need to help people mature their data practice faster, and we should not let perfection get in the way of good. Most companies are not large enterprises with millions of dollars to spend on initiatives like atomic modeling. That being said, I have yet to see anchor modeling in practice, so I welcome the opportunity to learn. I remember years ago the debates about how Ruby on Rials was teaching people "bad practices."  The other side of that argument is that Rails helped companies like Twitter and Github launch businesses faster. Rails was also better than the alternative at the time, which included messy PHP code. Others advocated for well-crafted "scalable" and expensive Java applications. Rails may not be the powerhouse it once was, but it has had a huge influence on how we build software. I even see its influence in dbt even if it might not have been intentional or direct.

Conclusion

Tools like Snowflake and dbt allow us to build processes that are much better than what most people have. Should we focus on all the "bad" things that may come with the modern data stack? Should we focus on how practitioners are not well educated, and so we need to throw all they are doing out?

I don't think so; I believe that we can help companies mature their data practices faster. Will we have the best data models? Maybe not. Will users do things perfectly? Nope. But can we help them move faster and guide them along their journey to avoid big pitfalls? I think we can. Getting people to use git, automating testing, and creating DataOps processes is a huge step forward for many organizations. Let's start there.

There's a reason Data Mesh and the Modern Data Stack resonate with so many people. There's a desire to do things faster with more autonomy at many companies, not just the ones with multi-million-dollar budgets. Let's focus on what is achievable, do the best we can, and help people mature along the way. We don't need more complexity; we need less.

Team work on data stack
Getting started with dbt and snowflake
5 mins read

Using dbt with Snowflake is one of the most popular and powerful data stacks today. Are you facing one of these situations:

- You just joined a data team using dbt Core with Snowflake, and you want to set up your local dbt development environment

- Your company is already using Snowflake and wants to try out dbt.

- You are using dbt with another data warehouse and you want to try dbt with Snowflake before migrating.

If you are facing any of these, this guide will help you set up your dbt environment for Snowflake using VS Code as your dbt development environment. We will also cover useful python libraries and VS Code extensions that will make you more productive. Want to become a dbt on Snowflake ninja, keep reading.

Pre-requisites for dbt development on Snowflake

To get started, you need to set up your dbt development environment. This includes Python, dbt, and VS Code. You will also need to set up your Snowflake account properly so dbt can work its magic.

While dbt supports versions of Python greater than 3.7, some other tools like Snowpark require Python 3.8, therefore we recommend you stick with that version. You can find the installer for your particular Operating System on this page. Finding the right Windows Installer can be confusing, so look for a link titled Windows installer (64-bit). If you are on a Mac, you can use the Universal Installer macOS 64-bit universal2 installer.

When using dbt, you will also need Git. To setup git, follow this handy guide.

The preferred IDE for dbt is VS Code, you can find it on the official Microsoft site. click the big Download button and install like any other app.

Installing dbt is done using pip. You can find more information on this page.

For dbt Snowflake, simply run:

pip install dbt-snowflake

This will install the latest version of dbt core along with the dbt adapter for Snowflake. If you need to use an older version of dbt, you will need to specify it when you run pip. However, the version of the dbt adapter may not match the version of dbt core.

For example, as of this writing, the last version of dbt core 1.3.x is 1.3.4. However, dbt-snowflake for 1.3.x is 1.3.2.  So, if you want to install the latest dbt 1.3.x for snowflake, you would run

pip install dbt-snowflake version

This will install dbt-snowflake 1.3.2 along with dbt-core 1.3.4

Configure Snowflake permissions

The final pre-requisite you will need to do is set up a Snowflake user that has been granted a role with the right access to a database where dbt will create views and tables. The role should also have access to a Snowflake Warehouse for compute. Here is a handy guide that gives you the basics. We would recommend a more comprehensive setup for a production deployment, but this will get you going for now.

The key items that are not covered in that guide is that you should create a role ANALYST and a database ANALYTICS_DEV and grant OWNERSHIP of that database to the ANALYST role. The ANALYST role should also be granted USAGE on the TRANSFORMING warehouse.  You also need to grant the ANALYST role to your specific user. Don’t run as ACCOUNTADMIN.  

This is all needed because when dbt runs it will create a schema for your user in the ANALYTICS_DEV database and you will use the TRANSFORMING warehouse when compute is needed like when dbt creates tables.

If all of this seems confusing or tedious, you should consider a managed dbt solution like dbt Cloud or Datacoves. For more information, checkout our in-depth article where we compare dbt cloud vs dbt core as well as managed dbt core.

Now with all the pre-requisites out of the way, let’s configure dbt to connect with Snowflake.

Configure your dbt Snowflake profile

To initialize your dbt project in Snowflake, dbt has a handy command dbt init. You can configure your dbt Snowflake profile using the dbt init command both for a new and a existing dbt project. First you will need to clone your repo using git. Then, simply run the dbt-init command and go through the prompts.  

Once you get your project set up, consider adding a profile_template.yml to your project. As stated on that page, using a profiles template will simplify the dbt init process for users on your team.

dbt init

To make sure dbt can connect to Snowflake, run dbt debug. If dbt can connect to your Snowflake account, you should see “All checks passed!” If you have problems, then join the dbt Community search the forums or ask a question in the #db-snowflake channel.

dbt debug used to check dbt to Snowflake connection
dbt debug used to check dbt to Snowflake connection

Even though dbt performed the setup of your profile.yml to connect to Snowflake with your credentials, it only provides the basic setup. This page provides additional parameters. that you can configure for the Snowflake connection in your profiles.yml file.  

If you want to configure those parameters, you will need to open and edit the profiles.yml file. The profiles.yml file created by dbt init will be in your user’s home directory in a subdirectory called .dbt.

One handy configuration parameter to change is reuse_connections. Also, if you use SSO authentication with external browser, you should consider setting up connection caching on Snowflake, otherwise you will be prompted to authenticate for every connection dbt opens to the database.

Now that you have set up your dbt connection to Snowflake, there are some other options you can configure when dbt runs against your Snowflake account. This includes overriding the default warehouse for a specific model, adding query tags, copying grants, etc. This handy page has a lot more information on these dbt snowflake advanced configurations.

Improve your dbt Snowflake experience with SQLFluff and other Python libraries

Now that you have dbt connecting to your database, let’s talk about some python libraries you should set up to improve how you work with dbt.

dbt-coves

dbt-coves is an open source library created by Datacoves to complement dbt development by simplifying tedious tasks like generating staging models. It is a must-have tool for any dbt practitioner who wants to improve their efficiency. dbt-coves will automatically create your source.yml and staging models as well as their corresponding yml(property) files. It also has utilities for backing up Airbyte and Fivetran configurations.

SQLFluff

SQLFluff is a Python library for linting SQL code. SQLFluff seamlessly integrates dbt using a templater and it is the only linter compatible with dbt. If you have not heard of code linting it helps you enforce rules on how your SQL code is formatted for example, should everyone use leading or trailing commas, should SQL keywords be upper or lower case. We recommend everyone use a linter as this will improve code readability and long term maintainability.  

pre-commit with dbt-checkpoint

dbt-checkpoint is a tool that allows you to make sure your dbt project complies with certain governance rules. For example, you can have a rule that validates whether every dbt model has a description. You can also ensure that every column is documented among many other rules. We also recommend the use of dbt-checkpoint as it will assure developers don’t add technical debt from the start of a project.

In addition to these Python libraries, at Datacoves we set up the development environment with other handy libraries like Snowpark and Streamlit. We believe that flexibility is important especially in enterprise environments. If you want to learn what to consider when selecting a managed dbt core platform, check out our guide.

Improve your dbt Snowflake experience with dbt power user and other VS Code extensions

In addition to Python libraries, you can improve your dbt workflow with Snowflake by installing these VS Code extensions.

Snowflake VS Code Extension

The official Snowflake dbt extension keeps you in the flow by bringing the Snowflake interface to VS Code. With it you can explore your database, run queries, and even upload and download files to Snowflake stages. It is a must-have for any Snowflake user.

dbt power user

dbt power user is a VS Code extension that improves the dbt development experience by adding handy shortcuts to run dbt models, tests, and even let’s you preview the result of a dbt model or a CTE within that model.  

SQLFluff VS Code extension

The SQLFluff VS Code extension is the companion to the SQLFluff python library. It improves the development experience by highlighting linting errors right in line with your SQL code. It even has a handy hover which describes the linting error and links to the SQLFluff documentation.

SQLFluff Linting Error hover on VS Code
SQLFluff Linting Error hover on VS Code

There are many other great VS Code extensions and at Datacoves we are always improving the dbt developer’s experience by pre-installing them. One recent addition demonstrated on the video below is a ChatGPT extension that allows you to improve the dbt experience by writing documentation in addition to other functions.

Conclusion

Getting started with dbt and Snowflake is not hard and knowing how to streamline the development experience when working with dbt and Snowflake will maximize developer happiness.

Some users may run into issues configuring their development environment. If that happens, check out the #sqlfluff, #tools-dbt-libraries, and #tools-vscode channels on the dbt Slack community. There are many helpful people there always ready to help.

As you can see there are a lot of steps and potential gotchas to get a dbt environment properly configured. This gets more complicated as the number of dbt users increases. Upgrading everyone can also pose a challenge. These reasons and more are why we created the most flexible managed dbt-core environment. If you want your developers to be up and running in minutes with no installation required, reach out and we can show you how we can streamline your teams’ dbt experience with best practices from the start.

Datacoves joins the TinySeed family
5 mins read

Genesis of Datacoves and beyond

A few years ago, we identified a need to simplify how companies implement the Modern Data Stack, both from an infrastructure and process perspective. This vision led to the creation of Datacoves. Our efforts were recognized by Snowflake as we became a top-10 finalist in the 2022 Snowflake Startup Challenge. We have become trusted partners at Fortune 100 companies, and now we are ready for our next chapter.

Datacoves is thrilled to announce that we have been selected to be part of the Spring 2023 cohort at TinySeed. Their philosophy and culture align flawlessly with our values and aspirations. Our goal has always been to help our customers revolutionize the way their teams work with data, and this investment will enable us to continue delivering on that vision.

The TinySeed investment will also strengthen our commitment to supporting open-source projects within the dbt community, such as SQLFluff, dbt-checkpoint, dbt-expectations, and dbt-coves. Our contributions will not only benefit our customers but also the broader dbt community, as we believe open-source innovation enhances the tools and resources available to everyone.

We want to express our deep gratitude to Tristan Handy, CEO & Founder at dbt Labs and the entire dbt Labs team for partnering with us on the 2023 State of Analytics Engineering Community survey and for creating the amazing dbt framework. Without dbt, Datacoves would not be where it is today, and our customers would not enjoy the benefits that come with our solutions.

As we continue to grow and evolve, we look forward to finding new ways to collaborate with dbt Labs and the dbt community to further enhance how people use dbt and provide the best possible experience for everyone. Together, we can empower data teams around the world and revolutionize the way they work.

TinySeed group picture
TinySeed 2023 Spring Cohort after some Rage Room fun

Dataops top tools
5 mins read

In continuation of our previous blog discussing the importance of implementing DataOps, we now turn our attention to the tools that can efficiently streamline your processes. Additionally, we will explore real-life examples of successful implementations, illustrating the tangible benefits of adopting DataOps practices.

Which DataOps tools can help streamline your processes?

There are a lot of DataOps tools that can help you automate data processes, manage data pipelines, and ensure the quality of your data. These tools can help data teams work faster, make fewer mistakes, and deliver data products more quickly.

Here are some recommended tools needed for a robust DataOps process:

  1. dbt (data build tool): dbt is an open-source data transformation tool that lets teams use SQL to change the data in their warehouse. dbt has a lightweight modeling layer and features like dependency management, testing, and the ability to create documentation. Since dbt uses version-controlled code, it is easy to see changes to data transformations (code reviews) before they are put into production.  dbt can dynamically change the target of a query's "FROM" statement on the fly and this allows us to run the same code against development, test, and production databases by just changing a configuration. During the CI process, dbt also lets us run only the changed transformations and their downstream dependencies.
  2. Fivetran: Fivetran is an Extract and Load(EL) tool that has been gaining in popularity in recent years since it removes the complexity of writing and maintaining custom scripts to extract data from SaaS tools like Salesforce and Google Analytics.  By automating extracting data from hundred's of sources Fivetran removes complexity freeing data engineers to work on projects with a bigger impact. Finally, Fivetran has a robust API which allows you to save configurations done vie their UI for disaster recovery or to promote configurations form a development to a production environment.
  3. Airbyte: Airbyte is another data-ingestion EL tool that is appealing because it is open source, requires little or no code, and is run by the community. It also makes it easier to extract and load data without having to do custom coding.  Airbyte also offers a connector development kit to help companies build custom connectors that may not be available. This allows companies to leverage most of the Airbyte functionality without too much work. There's also an API that can be used to retrieve configurations for disaster recovery.
  4. SQLFluff: SQLFluff is an open-source SQL linter that helps teams make sure their SQL code is consistent and follows best practices. It gives you a set of rules that you can change to find and fix syntax errors, inconsistencies in style, and other common mistakes. Sqlfluff can be added to the CI/CD pipeline so that problems are automatically found before they are added to the codebase. By using a tool like SQLFluff, you can make sure your team follows a consistant coding style and this will help with long term project maintainability.
  5. dbt-checkpoint: dbt-checkpoint provides validators to make sure your dbt projects are good. dbt is great, but when a project has a lot of models, sources, and macros, it gets hard for all the data analysts and analytics engineers to maintain the same level of quality. Users forget to update the columns in property (yml) files or add descriptions for the table and columns. Without automation, reviewers have to do more work and may miss mistakes that weren't made on purpose. Organizations can add automated validations with dbt-checkpoint, which makes the code review and release process better.
  6. Hashboard: Hashboard is a business intelligence (BI) product built for data engineers to do their best work and easily spread the data love to their entire organizations. Hashboard has an interactive data exploration tool that enables anyone in an organization to discover actionable insights.
  7. GitHub: GitHub offers a cloud-based Git repository hosting service. It makes it easier for people and teams to use Git for version control and to work together. GitHub can also run the workflows needed for CI/CD and it provides a simple UI for teams to perform code reviews and allows for approvals before code is moved to production.
  8. Docker: Docker makes it easy for data teams to manage dependencies such as the versions of libraries such as dbt, dbt-checkpoint, SQLFluff, etc.. Docker makes development workflows more robust by integrating the development pipeline and combining dependencies simplifying reproducibility.

Examples of companies who have successfully implemented DataOps

Dataops in Top Companies

DataOps has been successfully used in the real world by companies of all sizes, from small startups to large corporations. The DataOps methodology is based on collaboration, automation, and monitoring throughout the entire data lifecycle, from collecting data to using it. Organizations can get insights faster, be more productive, and improve the quality of their data. DataOps has been used successfully in many industries, including finance, healthcare, retail, and technology.

Here are a few examples of real-world organizations that have used DataOps well:

  1. Optum: Optum is part of UnitedHealthcare. Optum prioritizes healthcare data management and analytics and when they wanted to implement new features and apps quickly, they turned to DataOps. DataOps helped Optum break down silos, saving millions of dollars annually by reducing compute usage. Optum managed data from dozens of sources via thousands of APIs, which was its biggest challenge. A massive standardization and modernization effort created a scalable, centralized data platform that seamlessly shared information across multiple consumers.
  2. JetBlue: DataOps helped JetBlue make data-driven decisions. After struggling with an on-premises data warehouse, the airline migrated to the cloud to enable self-service reporting and machine learning. They've cleaned, organized, and standardized their data and leveraged DataOps to create robust processes. Their agility in data curation has enabled them to increase data science initiatives.
  3. HubSpot: HubSpot is a leading company that makes software for inbound marketing and sales. It used DataOps to improve the use of its data. By using a DataOps approach, HubSpot was empowered to do data modeling the right way, to define model dependencies, and to update and troubleshoot models, which resulted in a highly scalable database and opened up new data application possibilities.
  4. Nasdaq: Nasdaq, a global technology company that provides trading, clearing, and exchange technology, adopted DataOps to improve its data processing and analysis capabilities. They launched a data warehouse, products, and marketplaces quickly. After scalability issues, they moved to a data lake and optimized their data infrastructure 6x. The migration reduced maintenance costs and enabled analytics, ETL, reporting, and data visualization. This enabled better and faster business opportunity analysis.  
  5. Monzo: Monzo is a UK-based digital bank that used DataOps to create a data-driven culture and improve its customer experience. By letting everyone make and look at the different data maps, they are helping teams figure out how their changes affect the different levels of their data warehouse. This gave the Monzo data team confidence that the data they give to end users is correct.

What is the future of DataOps adoption?

Future of DataOps

DataOps has a bright future because more and more businesses are realizing how important data is to their success. With the exponential growth of data, it is becoming more and more important for organizations to manage it well. DataOps will likely be used by more and more companies as they try to streamline their data management processes and cut costs. Cloud-based data management platforms have made it easier for organizations to manage their data well. Some of the main benefits of these platforms are that they are scalable, flexible, and cost-effective. With DataOps teams can improve collaboration, agility, and build trust in data by creating processes that test changes before they are rolled out to production.

With the development of modern data tools, companies can now adopt software development best practices in analytics. In today’s fast-paced world, it's important to give teams the tools they need to respond quickly to changes in the market by using high-quality data. Companies should use DataOps if they want to manage data better and reduct the technical debt created from uncontrolled processes. Putting DataOps processes in place for the first time can be hard, and it's easier said than done. DataOps requires a change in attitude, a willingness to try out new technologies and ways of doing things, and a commitment to continuous improvement. If an organization is serious about using DataOps, it must invest in the training, infrastructure, and cultural changes that are needed to make it work. With the right approach, companies can get the most out of DataOps and help their businesses deliver better outcomes.

At Datacoves, we offer a suite of DataOps tools to help organizations implement DataOps quickly and efficiently. We enable organizations to start automating simple processes and gradually build out more complex ones as their needs evolve. Our team has extensive experience guiding organizations through the DataOps implementation process.

Schedule a call with us, and we'll explain how dbt and DataOps can help you mature your data processes.

Implement dataops
5 mins read

The amount of data produced today is mind-boggling and is expanding at an astonishing rate. This explosion of data has led to the emergence of big data challenges that organizations must overcome to remain competitive. Organizations must effectively manage data, guarantee its accuracy and quality, in order to derive actionable insights from it. This is where DataOps comes in. In this first post we go through an overview of DataOps and in our next post we will discuss tooling that support DataOps as well as discuss companies that have seen the benefit of implementing DataOps Processes.

What is DataOps?

DataOps is a methodology that merges DevOps principles from software development with data management practices. This helps to improve the process of developing, deploying, and maintaining data-driven applications. The goal is to increase the delivery of new insights and the approach emphasizes collaboration between data scientists, data engineers, data analysts, and analytics engineers.

Automation and continuous delivery are leveraged for faster deployment cycles. Organizations can enhance their data management procedures, decrease errors, and increase the accuracy and timeliness of their data by implementing DataOps.

The overarching goal is to improve an organization's capacity for decision-making by facilitating quicker, more precise data product deliveries. DataOps can speed up task completion with fewer errors by encouraging effective team collaboration, ultimately enhancing the organization's success.

DataOps is now feasible thanks to modern data stack tools. Modern data tools like Snowflake and dbt can enable delivery of new features more quickly while upholding IT governance standards. This allows for employee empowerment and improves their agility all while helping to create a culture that values data-driven decision-making. Effective DataOps implementation enables businesses to maximize the value of their data and build a data-driven culture for greater success.

The importance of DataOps in solving big data challenges

Importance of DataOps

It is impossible to overstate the role that DataOps plays in overcoming big data challenges. Organizations must learn how to manage data effectively as it continues to increase if they want to stay competitive.

DataOps addresses these challenges by providing a set of best practices and tools that enable data teams to work more efficiently and effectively. Some of the key benefits of DataOps include:

  1. Improved agility and flexibility: DataOps empowers data teams to work in an agile manner, enabling them to quickly adapt to shifting requirements and business needs. By basing decisions on up-to-the-minute information, businesses can stay one step ahead of the competition.
  2. Increased collaboration: DataOps encourages collaboration between different teams involved in the data management process, including data engineers, data scientists, and business analysts. This leads to better communication, faster feedback loops, and more efficient workflows.
  3. Enhanced quality and accuracy: DataOps lowers the risk of errors and inconsistencies in the data by automating data deliverable processing and analysis pipelines. As a result, businesses are better able to make decisions based on accurate and trustworthy data
  4. Reduced costs: DataOps helps businesses save money by reducing the time and resources needed to manage and perform data analytics. As a result, they can invest in other aspects of their business to maintain their competitiveness.

In a world without DataOps, data-related processes can be slow, error-prone, and documentation is often outdated. DataOps provides the opportunity to rethink how data products are delivered. This promotes a culture of federated delivery. Everyone in the organization with the necessary skills is empowered to participate in insight generation.

We can achieve comparable results while preserving governance over the delivery process by learning from large, distributed open-source software development projects like the Linux project. To ensure high-quality output, a good DataOps process places an emphasis on rules and procedures that are monitored and controlled by a core team.

What are the components for successful DataOps implementation?

Successful DataOps implementation

To implement DataOps well, one needs to take a holistic approach that brings together people, processes, and technology. Here are the five most important parts of DataOps that organizations need to think about in order to implement it successfully:

  1. Collaborative Culture: Organizations must foster a collaborative culture and they must encourage cross-functional teamwork and make it easy for data scientists, data engineers, data analysts, and analytics engineers to talk to each other and have visibility into each other's work. Collaboration breaks down silos and makes it easier for everyone to understand how data-driven applications work. A collaborative culture also helps team members feel like they own their work and are responsible for it, which can help improve the quality of the data and the applications that use it.
  2. Automation: Automation is a key part of DataOps because it helps organizations streamline their feature deliveries and cut down on mistakes. Data ingestion, data integration, and data validation are all examples of tasks that are done over and over again. By automating these tasks, organizations can free up their resources to focus on more complex and valuable tasks, like analyzing data and deriving insights from it.
  3. Continuous Integration and Delivery: Continuous integration and delivery (CI/CD) are important parts of DataOps. Together they enable the deployment of working code to production in a continuous and iterative way. CI/CD can help speed up the time it takes to get data-driven apps to market so companies can respond quickly to changing business needs and provide more value to their customers.
  4. Monitoring and Feedback: Monitoring and feedback are important parts of DataOps because they help make sure that data-driven applications work as they should. By keeping an eye on how well data-driven applications work, organizations can catch problems early and fix them as needed. By asking users for feedback, the code can be made more reliable and maintainable leading to better better outcomes.
  5. Data Governance: Data governance is a set of policies, procedures, and controls that make sure data is managed correctly throughout its lifecycle. Data governance includes data ownership, data quality, data security, data privacy, and data lineage. By putting in place strong data governance practices, organizations can make sure that their data is correct, consistent, and in line with company standards. Data governance can help make the data more trustworthy and lower the chance of a data breach.

Having identified what needs to be done to achieve a good DataOps implementation, you may be wondering how we turn these ideas into actionable steps.

What are the best ways to implement DataOps in organizations?

Best ways to implement DataOps in organizations
  1. Start Small and Grow: When putting DataOps in place, it's important to start small and grow gradually. This method lets companies try out and improve their DataOps processes before putting them to use on bigger projects. By starting out small, organisations can find potential problems and solve them before rolling them out to a larger audience.
  2. Align DataOps with Business Goals: For an organization's DataOps implementation to work, its DataOps processes need to be in line with its business goals. Organizations should figure out what their most important business goals are for their data-driven applications, and then design their DataOps processes to help them reach these goals. This approach makes sure that DataOps is focused on adding value to the business and not just on implementing new technology.
  3. Foster collaboration between teams: One of the most important parts of DataOps is getting teams to work together. Organizations need to help data scientists, data engineers, data analysts, and analytics engineers work together. This method breaks down silos and makes it easier for everyone to understand how data-driven applications work. Organizations can make it easier for people to work together by putting them on cross-functional teams, encouraging open communication, fostering transparency, and where possible encouraging the same tools, e.g. using dbt for data pipelines.
  4. Invest in tools for automation. Automation is a key part of DataOps. Organizations should invest in automation tools that can help them speed up their data management processes, such as data ingestion, data integration, and data validation. Automation can help cut down on mistakes, increase productivity, and free up resources to be used on more complicated tasks, like analyzing data.
  5. Implement CI/CD processes: Creating a robust feature release process with automated validations and code review will encourage collaboration and build transparency in the process. Users will be encouraged to move faster by removing the fear that a new feature will break production.
  6. Monitor Performance and Get Feedback: Two important parts of DataOps are monitoring performance and getting feedback. Organizations should use monitoring tools that can keep track of how well data-driven applications are working and get feedback from users. This method helps find problems early on and makes the data more accurate and reliable. It also gives organizations useful information about how users leverage tools and this can help them improve the user experience.

Looking to implement DataOps quickly and efficiently?  

At Datacoves, we offer a comprehensive suite of DataOps tools to help organizations implement robust processes quickly and efficiently. One of the key benefits of using Datacoves is the ability to implement DataOps incrementally using our dynamic Infrastructure. Organizations can start by automating simple processes, such as data collection and cleaning, and gradually build out more complex processes as their needs evolve.

Our agile methodology allows for iterative development and continuous deployment, ensuring that our tools and processes are tailored to meet the evolving needs of our clients. This approach allows organizations to see the benefits of DataOps quickly, without committing to a full-scale implementation upfront. We have helped teams implement mature DataOps processes in as little as six weeks.

The Datacoves platform is being used by top global organizations and our team has extensive experience in guiding organizations through the DataOps implementation process. We ensure that you benefit from best practices and avoid common pitfalls when implementing DataOps.

Set aside some time to speak with us to discuss how we can support you in maturing your data processes with dbt and DataOps.

Don't miss out on our latest blog, where we showcase the best tools and provide real-life examples of successful implementations.

Get ready to take your data operations to the next level!

Get our free ebook dbt Cloud vs dbt Core

Get the PDF
Download pdf