Mayra Peña

Mayra Peña

Data Engineering | Technical Customer Success.
Solving enterprise data challenges quickly with dbt & Airflow.

what is microsoft fabric
5 mins read

There's a lot of buzz around Microsoft Fabric these days. Some people are all-in, singing its praises from the rooftops, while others are more skeptical, waving the "buyer beware" flag. After talking with the community and observing Fabric in action, we're leaning toward caution. Why? Well, like many things in the Microsoft ecosystem, it's a jack of all trades but a master of none. Many of the promises seem to be more marketing hype than substance, leaving you with "marketecture" instead of solid architecture. While the product has admirable, lofty goals, Microsoft has many wrinkles to iron out.

In this article, we'll dive into 10 reasons why Microsoft Fabric might not be the best fit for your organization in 2025. By examining both the promises and the current realities of Microsoft Fabric, we hope to equip you with the information needed to make an informed decision about its adoption.

What is Microsoft Fabric?

Microsoft Fabric is marketed as a unified, cloud-based data platform developed to streamline data management and analytics within organizations. Its goal is to integrate various Microsoft services into a single environment and to centralize and simplify data operations.

This means that Microsoft Fabric is positioning itself as an all-in-one analytics platform designed to handle a wide range of data-related tasks. A place to handle data engineering, data integration, data warehousing, data science, real-time analytics, and business intelligence. A one stop shop if you will. By consolidating these functions, Fabric hopes to provide a seamless experience for organizations to manage, analyze, and gather insights from their data.

Core Components of Microsoft Fabric

  • OneLake: OneLake is the foundation of Microsoft Fabric, serving as a unified data lake that centralizes storage across Fabric services. It is built on Delta Lake technology and leverages Azure Blob Storage, similar to how Apache Iceberg is used for large-scale cloud data management
  • Synapse Data Warehouse: Similar to Amazon Redshift this provides storage and management for structured data. It supports SQL-based querying and analytics, aiming to facilitate data warehousing needs.
  • Synapse Data Engineering: Compute engine based on Apache Spark, similar to Databricks' offering. It is built on Apache Spark and is intended to support tasks such as data cleaning, transformation, and feature engineering.
  • Azure Data Factory: A tool for pipeline orchestration and data loading which is also part of Synapse Data Engineering
  • Synapse Data Science: Similar to Jupiter Notebooks that can only run on Azure Spark. It is designed to support data scientists in developing predictive analytics and AI solutions by leveraging Azure ML and Azure Spark services.
  • Synapse Real-Time Analytics: Enables the analysis of streaming data from various sources including Kafka, Kinesis, and CDC sources.
  • Power BI: This is a BI (business intelligence) tool like Tableau tool designed to create data visualizations and dashboards.
Image from: https://learn.microsoft.com/en-us/fabric/

Fabric presents itself as an all-in-one solution, but is it really? Let’s break down where the marketing meets reality.

10 Reasons It’s Still Not the Right Choice in 2025

While Microsoft positions Fabric is making an innovative step forward, much of it is clever marketing and repackaging of existing tools. Here’s what’s claimed—and the reality behind these claims:

1. Fragmented User Experience, Not True Unification

Claim: Fabric combines multiple services into a seamless platform, aiming to unify and simplify workflows, reduce tool sprawl, and make collaboration easier with a one-stop shop.

Reality:

  • Rebranded Existing Services: Fabric is mainly repackaging existing Azure services under a new brand. For example, Fabric bundles Azure Data Factory (ADF) for pipeline orchestration and Azure Synapse Analytics for traditional data warehousing needs and Azure Spark for distributed workloads. While there are some enhancements to synapse to synchronize data from OneLake, the core functionalities remain largely unchanged. PowerBI is also part of Fabric  and this tool has existed for years as have notebooks under the Synapse Data Science umbrella.
  • Steep Learning Curve and Complexity: Fabric claims to create a unified experience that doesn’t exist in other platforms, but it just bundles a wide range of services—from data engineering to analytics—and introduces new concepts (like proprietary query language, KQL which is only used in the Azure ecosystem). Some tools are geared to different user personas such as ADF for data engineers and Power BI for business analysts, but to “connect” an end-to-end process, users would need to interact with different tools. This can be overwhelming, particularly for teams without deep Microsoft expertise. Each tool has its own unique quirks and even services that have functionality overlap don’t work exactly the same way to do the same thing. This just complicates the learning process and reduces overall efficiency.

2. Performance Bottlenecks & Throttling Issues

Claim: Fabric offers a scalable and flexible platform.


Reality: In practice, managing scalability in Fabric can be difficult. Scaling isn’t a one‑click, all‑services solution—instead, it requires dedicated administrative intervention. For example, you often have to manually pause and un-pause capacity to save money, a process that is far from ideal if you’re aiming for automation. Although there are ways to automate these operations, setting up such automation is not straightforward. Additionally, scaling isn’t uniform across the board; each service or component must be configured individually, meaning that you must treat them on a case‑by‑case basis. This reality makes the promise of scalability and flexibility a challenge to realize without significant administrative overhead.  

3. Capacity-Based Pricing Creates Cost Uncertainty

Claim: Fabric offers predictable, cost-effective pricing.

Reality: While Fabric's pricing structure appears straightforward, several hidden costs and adoption challenges can impact overall expenses and efficiency:  

  • Cost uncertainty: Microsoft Fabric uses a capacity-based pricing model that requires organizations to purchase predefined Capacity Units (CUs).  Organizations need to carefully assess their workload requirements to optimize resource allocation and control expenses. Although a pay-as-you-go (PAYG) option is available, it often demands manual intervention or additional automation to adjust resources dynamically.  This means organizations often need to overprovision compute power to avoid throttling, leading to inefficiencies and increased costs. The problem is you pay for what you think you will use and get a 40% discount. If you don’t use all of the capacity, then there are wasted capacity. If you go over capacity, you can do configure PAYG (pay as you go) but it’s at full price. Unlike true serverless solutions, you pay for allocated capacity regardless of actual usage. This isn’t flexible like the cloud was intended to be.  👎
  • Throttling and Performance Degradation: Exceeding purchased capacity can result in throttling, causing degraded performance. To prevent this, organizations might feel compelled to purchase higher capacity tiers, further escalating costs.
  • Visibility and Cost Management: Users have reported challenges in monitoring and predicting costs due to limited visibility into additional expenses. This lack of transparency necessitates careful monitoring and manual intervention to manage budgets effectively.  
  • Adoption and Training Time: It’s important to note that implementing Fabric requires significant time investment in training and adapting existing workflows. While this is the case with any new platform, Microsoft is notorious for complexity in their tooling and this can lead to longer adoption periods, during which productivity may temporarily decline.

All this to say that the pricing model is not good unless you can predict with great accuracy exactly how much you will spend every single day, and who knows that? Check out this article on the hidden cost of fabric which goes into detail and cost comparisons.

4. Limited Compatibility with Non-Microsoft Tools

Claim: Fabric supports a wide range of data tools and integrations.

Reality: Fabric is built around a tight integration with other Fabric services and Microsoft tools such as Office 365 and Power BI, making it less ideal for organizations that prefer a “best‑of‑breed” approach (or rely on tools like Tableau, Looker, open-source solutions like Lightdash, or other non‑Microsoft solutions), this can severely limit flexibility and complicate future migrations.

While third-party connections are possible, they don’t integrate as smoothly as those in the MS ecosystem like Power BI, potentially forcing organizations to switch tools just to make Fabric work.

5. Poor DataOps & CI/CD Support

Claim: Fabric simplifies automation and deployment for data teams by supporting modern DataOps workflows.


Reality: Despite some scripting support, many components remain heavily UI‑driven. This hinders full automation and integration with established best practices for CI/CD pipelines (e.g., using Terraform, dbt, or Airflow). Organizations that want to mature data operations with agile DataOps practices find themselves forced into manual workarounds and struggle to integrate Fabric tools into their CI/CD processes. Unlike tools such as dbt, there is not built-in Data Quality or Unit Testing, so additional tools would need to be added to Fabric to achieve this functionality.

6. Security Gaps & Compliance Risks

Claim: Microsoft Fabric provides enterprise-grade security, compliance, and governance features.

Reality: While Microsoft Fabric offers robust security measures like data encryption, role-based access control, and compliance with various regulatory standards, there are some concerns organizations should consider.

One major complaint is that access permissions do not always persist consistently across Fabric services, leading to unintended data exposure.

For example, users can still retrieve restricted data from reports due to how Fabric handles permissions at the semantic model level. Even when specific data is excluded from a report, built-in features may allow users to access the data, creating compliance risks and potential unauthorized access. Read more: Zenity - Inherent Data Leakage in Microsoft Fabric.

While some of these security risks can be mitigated, they require additional configurations and ongoing monitoring, making management more complex than it should be. Ideally, these protections should be unified and work out of the box rather than requiring extra effort to lock down sensitive data.

7. Lack of Maturity & Changes that Disrupt Workflow

Claim: Fabric is presented as a mature, production-ready analytics platform.

Reality: The good news for Fabric is that it is still evolving. The bad news is, it's still evolving. That evolution impacts users in several ways:  

  • Frequent Updates and Unstable Workflows: Many features remain in preview, and regular updates can sometimes disrupt workflows or introduce unexpected issues. Users have noted that the platform’s UI/UX is continually changing, which can impact consistency in day-to-day operations. Just when you figure out how to do something, the buttons change. 😤
  • Limited Features: Several functionalities are still in preview or implementation is still in progress. For example, dynamic connection information, Key Vault integration for connections, and nested notebooks are not yet fully implemented. This restricts the platform’s applicability in scenarios that rely on these advanced features.
  • Bugs and Stability Issues: A range of known issues—from data pipeline failures to problems with Direct Lake connections—highlights the platform’s instability. These bugs can make Fabric unpredictable for mission-critical tasks. One user lost 3 months of work!

8. Black Box Automation & Limited Customization

Claim: Fabric automates many complex data processes to simplify workflows.

Reality: Fabric is heavy on abstractions and this can be a double‑edged sword. While at first it may appear to simplify things, these abstractions lead to a lack of visibility and control. When things go wrong it is hard to debug and it may be difficult to fine-tune performance or optimize costs.

For organizations that need deep visibility into query performance, workload scheduling, or resource allocation, Fabric lacks the granular control offered by competitors like Databricks or Snowflake.

9. Limited Resource Governance and Alerting

Claim: Fabric offers comprehensive resource governance and robust alerting mechanisms, enabling administrators to effectively manage and troubleshoot performance issues.  

Reality: Fabric currently lacks fine-grained resource governance features making it challenging for administrators to control resource consumption and mitigate issues like the "noisy neighbor" problem, where one service consumes disproportionate resources, affecting others.  

The platform's alerting mechanisms are also underdeveloped. While some basic alerting features exist, they often fail to provide detailed information about which processes or users are causing issues. This can make debugging an absolute nightmare. For example, users have reported challenges in identifying specific reports causing slowdowns due to limited visibility in the capacity metrics app. This lack of detailed alerting makes it difficult for administrators to effectively monitor and troubleshoot performance issues, often needing the adoption of third-party tools for more granular governance and alerting capabilities. In other words, not so all in one in this case.

10. Missing Features & Gaps in Functionality

Claim: Fabric aims to be an all-in-one platform that covers every aspect of data management.  

Reality: Despite its broad ambitions, key features are missing such as:

  • Geographical Availability: Fabric's data warehousing does not support multiple geographies, which could be a constraint for global organizations seeking localized data storage and processing.  
  • Garbage Collection: Parquet files that are no longer needed are not automatically removed from storage, potentially leading to inefficient storage utilization.  

While these are just a couple of examples it's important to note that missing features will compel users to seek third-party tools to fill the gaps, introducing additional complexities.  Integrating external solutions is not always straight forward with Microsoft products and often introduces a lot of overhead.  Alternatively, users will have to go without the features and create workarounds or add more tools which we know will lead to issues down the road.  

Conclusion

Microsoft Fabric promises a lot, but its current execution falls short. Instead of an innovative new platform, Fabric repackages existing services, often making things more complex rather than simpler.

That’s not to say Fabric won’t improve—Microsoft has the resources to refine the platform. But as of 2025, the downsides outweigh the benefits for many organizations.

If your company values flexibility, cost control, and seamless third-party integrations, Fabric may not be the best choice. There are more mature, well-integrated, and cost-effective alternatives that offer the same features without the Microsoft lock-in.

Time will tell if Fabric evolves into the powerhouse it aspires to be. For now, the smart move is to approach it with a healthy dose of skepticism.

👉 Before making a decision, thoroughly evaluate how Fabric fits into your data strategy. Need help assessing your options? Check out this data platform evaluation worksheet.  

The secret to enterprise dbt analytics success
5 mins read

Enterprises are increasingly relying on dbt (Data Build Tool) for their data analytics; however, dbt wasn’t designed to be an enterprise-ready platform on its own. This leads to struggles with scalability, orchestration, governance, and operational efficiency when implementing dbt at scale. But if dbt is so amazing why is this the case? Like our title suggests, you need more than just dbt to have a successful dbt analytics implementation. Keep on reading to learn exactly what you need to super charge your data analytics with dbt successfully.  

Why Enterprises Adopt dbt for Data Transformation

dbt is popular because it solves problems facing the data analytics world. Enterprises today are dealing with growing volumes of data, making efficient data transformation a critical part of their analytics strategy. Traditionally, data transformation was handled using complex ETL (Extract, Transform, Load) processes, where data engineers wrote custom scripts to clean, structure, and prepare data before loading it into a warehouse. However, this approach has several challenges:

  • Slow Development Cycles – ETL processes often required significant engineering effort, creating bottlenecks and slowing down analytics workflows.
  • High Dependency on Engineers – Analysts and business users had to rely on data engineers to implement transformations, limiting agility.
  • Difficult Collaboration & Maintenance – Custom scripts and siloed processes made it hard to track changes, ensure consistency, and maintain documentation.
issues without dbt
Issues without dbt

dbt (Data Build Tool) transforms this paradigm by enabling SQL-based, modular, and version-controlled transformations directly inside the data warehouse. By following the ELT (Extract, Load, Transform) approach, dbt allows raw data to be loaded into the warehouse first, then transformed within the warehouse itself—leveraging the scalability and processing power of modern cloud data platforms.

Unlike traditional ETL tools, dbt applies software engineering best practices to SQL-based transformations, making it easier to develop, test, document, and scale data pipelines. This shift has made dbt a preferred solution for enterprises looking to empower analysts, improve collaboration, and create maintainable data workflows.

Key Benefits of dbt

  • SQL-Based Transformations – dbt enables data teams to perform transformations within the data warehouse using standard SQL. By managing the Data Manipulation Language (DML) statements, dbt allows anyone with SQL skills to contribute to data modeling, making it more accessible to analysts and reducing reliance on specialized engineering resources.
  • Automated Testing & Documentation – With more people contributing to data modeling things can become a mess but dbt shines by incorporating automated testing and documentation to ensure data reliability. With dbt  teams can have a decentralized development pattern but maintain centralized governance.  
  • Version Control & Collaboration – Borrowing from software engineering best practices dbt enables teams to track changes using Git. Any changes made to data models can be clearly tracked and reverted, simplifying collaboration.  
  • Modular and Reusable Code – dbt's powerful combination of SQL and Jinja enables the creation of modular and reusable code, significantly enhancing maintainability. Using Jinja, dbt allows users to define macros—reusable code snippets that encapsulate complex logic. This means less redundancies and consistent application of business rules across models.  
  • Scalability & Performance Optimization – dbt leverages the data warehouse’s native processing power, enabling incremental models that minimize recomputation and improve efficiency.
  • Extensibility & Ecosystem – dbt integrates with orchestration tools (e.g., Airflow) and metadata platforms (e.g., DataHub), supporting a growing ecosystem of plugins and APIs.

With these benefits it is clear why over 40,000 companies are leveraging dbt today!

The Challenges of Scaling dbt in the Enterprise

Despite dbt’s strengths, enterprises face several challenges when implementing it at scale for a variety of reasons:

Complexity of Scaling dbt

Running dbt in production requires robust orchestration beyond simple scheduled jobs. dbt only manages transformations, but a complete end-to-end pipeline includes Extracting, Loading and Visualizing of data. To manage the full end-to-end data pipeline (ELT + Viz) organizations will need a full-fledged orchestrator like Airflow. While there are other orchestration options on the market,  Airflow and dbt are a common pattern.  

Lack of Integrated CI/CD & Development Controls

CI/CD pipelines are essential for dbt at the enterprise level, yet one of dbt Core’s major limitations is the lack of a built-in CI/CD pipeline for managing deployments. This makes workflows more complex and increases the likelihood of errors reaching production. To address this, teams can implement external tools like Jenkins, GitHub Actions, or GitLab Workflows that provide a flexible and customizable CI/CD process to automate deployments and enforce best practices.

While dbt Cloud does offer an out-of-the-box CI/CD solution, it lacks customization options. Some organizations find that their use cases demand greater flexibility, requiring them to build their own CI/CD processes instead.

Infrastructure & Deployment Constraints

Enterprises seek alternative solutions that provide greater control, scalability, and security over their data platform. However, this comes with the responsibility of managing their own infrastructure, which introduces significant operational overhead ($$$). Solutions like dbt Cloud do not offer Virtual Private Cloud (VPC) deployment, full CI/CD flexibility, and a fully-fledged orchestrator leaving organizations to handle additional platform components.

We saw a need for a middle ground that combined the best of both worlds; something as flexible as dbt Core and Airflow, but fully managed like dbt Cloud. This led to Datacoves which provides a seamless experience with no platform maintenance overhead or  onboarding hassles. Teams can focus on generating insights from data and not worry about the platform.

Avoiding Vendor Lock-In

Vendor lock-in is a major concern for organizations that want to maintain flexibility and avoid being tied to a single provider. The ability to switch out tools easily without excessive cost or effort is a key advantage of the modern data stack. Enterprises benefit from mixing and matching best-in-class solutions that meet their specific needs.

How Datacoves Solves Enterprise dbt Challenges

Datacoves is a fully managed enterprise platform for dbt, solving the challenges outlined above. Below is how Datacoves' features align with enterprise needs:

Platform Capabilities

  • Integrated Development Environment (IDE): With in-browser VS Code, users can develop SQL and Python seamlessly within a browser-based VS Code environment. This includes full access to the terminal, python libraries, and VS Code extensions for the most customizable development experience.  
VS Code in Datacoves
  • Managed Development Environment: Pre-configured VS Code, dbt, and Airflow setup for enterprise teams. Everything is managed so project leads dont have to worry about dependencies, docker images, upgrades or onboarding. Datacoves users can be onboarded to a new project it minutes not days.  
  • Scalability & Flexibility: Kubernetes-powered infrastructure for elastic scaling. Users don’t have the operational overhead of managing their dbt and Airflow environments, they simply login and everything just works.  
  • Version Control & Collaboration: Datacoves integrates seamlessly with Git services like Github, Gitlab, Bitbucket, and Azure DevOps. When deployed in the customer’s VPC, Datacoves can even access private Git servers and Docker registries.
  • Security & User Management: Datacoves can integrate Single Sign-On (SSO) for authentication., and AD groups for role management.
  • Use of Open-Source Tools: Built on standard dbt Core, Airflow, and VS Code to ensure maximum flexibility. At the end of the day it is your code and you can take it with you.  

Data Extraction and Loading

  • With Datacoves, companies can leverage a managed Airbyte instance out of the box. However, if users are not using Airbyte or need addition EL tools Datacoves seamlessly integrates with enterprise EL solutions such as Amazon Glue, Azure Data Factory, Databricks, Streamsets, etc. Additionally, since Datacoves supports Python development organizations can leverage their custom Python frameworks or develop using tools like dlt (data load tool) with ease.
airbyte in datacoves
Airbyte in Datacoves

Data Transformation

  • Support for SQL & Python: In addition to SQL or Python modeling via dbt, users can develop non-dbt Python scripts right within VS Code.
  • Data Warehouse & Data Lake Support: As a platform, Datacoves is warehouse agnostic. It works with Snowflake, BigQuery, Redshift, Databricks, MS Fabric, and any other dbt-compatible warehouse.  

Pipeline Orchestration

  • Enterprise-Grade Managed Apache Airflow: By adopting a full fledged orchestrator, developers can orchestrate the full ELT + Viz pipeline minimizing cost and pipeline failures. One of the biggest benefits of Datacoves is its fully managed Airflow scheduler for data pipeline orchestration. Developers don’t have to worry about the infrastructure overhead or scaling headaches of managing their own Airflow.
Airflow in Datacoves
  • Developer Instance of Airflow ("My Airflow"): With a few clicks easily stand-up a solo Sandbox Airflow instance for testing DAGs before deployment. My Airflow can speed up DAG development by 20%+!
  • Orchestrator Flexibility & Extensibility: Datacoves provides templated accelerators for creating Airflow DAGs and managing dbt runs. These best practices can be invaluable to an organization getting started or looking to optimize.
  • Alerting & Monitoring: Out of the box SMTP integration as well as support for custom SMTP, Slack, and Microsoft Teams notifications for proactive monitoring.  

Data Quality and Governance

  • Cross-project lineage via Datacoves Mesh (aka dbt Mesh): Have a large dbt project that would benefit by being split into multiple projects? Datacoves enables large-scale cross-team collaboration with cross dbt project support.
  • Enterprise-Grade Data Catalog (Datahub): Datacoves provides an optionally hosted Datahub instance which comes with Column-level lineage for tracking data transformations and includes cross project column-level lineage support.
  • CI/CD Accelerators: Need a robust CI/CD pipeline? Datacoves provides accelerator scripts for Jenkins, Github Actions, and Gitlab workflows so teams dont start at square one. These scripts are fully customizable to meet any team’s needs.
  • Enterprise Ready RBAC: Datacoves provides tools and processes that simplify Snowflake permissions while maintainig the controls necessary for securing PII data and complying with GDPR and CCPA regulations.

Licensing and Pricing Plans

Datacoves offers flexible deployment and pricing options to accommodate various enterprise needs:

  • Deployment Options: Choose between Datacoves' multi-tenant SaaS platform or a customer-hosted Virtual Private Cloud (VPC) deployment, ensuring compliance with security and regulatory requirements.  
  • Scalable Pricing: Pricing structures are designed to scale to enterprise levels, optimizing costs as your data operations grow.
  • Total Cost of Ownership (TCO): By providing a fully managed environment for dbt and Airflow, Datacoves reduces the need for in-house infrastructure management, lowering TCO by up to 50%.  

Vendor Information and Support

Datacoves is committed to delivering enterprise-grade support and resources through our white-glove service:

  • Dedicated Support: Comprehensive support packages, providing direct access to Datacoves' development team for timely assistance in Teams, Slack, and or email.  
  • Documentation and Training: Extensive documentation and optional training packages to help teams effectively utilize the platform.  
  • Change Management Expertise: We know that true adoption does not lie with the tools but rather change management. As a thought leader on the subject, Datacoves has guided many organizations through the implementation and scaling of dbt, ensuring a smooth transition and adoption of best practices.  

Conclusion

Enterprises need more than just dbt to achieve scalable and efficient analytics. While dbt is a powerful tool for data transformation, it lacks the necessary infrastructure, governance, and orchestration capabilities required for enterprise-level deployments. Datacoves fills these gaps by providing a fully managed environment that integrates dbt-Core, VS Code, Airflow, and Kubernetes-based deployments, Datacoves is the ultimate solution for organizations looking to scale dbt successfully.  

Whats new in dbt 1.9
5 mins read

The latest release of dbt 1.9, introduces some exciting features and updates meant to enhance functionality and tackle some pain points of dbt. With improvements like microbatch incremental strategy, snapshot enhancements, Iceberg table format support, and streamlined CI workflows, dbt 1.9 continues to help data teams work smarter, faster, and with greater precision. All the more reason to start using dbt today!  

We looked through the release notes, so you don’t have to. This article highlights the key updates in dbt 1.9, giving you the insights needed to upgrade confidently and unlock new possibilities for your data workflows. If you need a flexible dbt and Airflow experience, Datacoves might be right for your organization. Lower total cost of ownership by 50% and shortened your time to market today!

Compatibility Note: Upgrading from Older Versions

If you are upgrading from dbt 1.7 or earlier, you will need to install both dbt-core and the appropriate adapter. This requirement stems from the decoupling introduced in dbt 1.8, a change that enhances modularity and flexibility in dbt’s architecture. These updates demonstrate dbt’s commitment to providing a streamlined and adaptable experience for its users while ensuring compatibility with modern tools and workflows.

pip install dbt-core dbt-snowflake

Microbatch Incremental Strategy: A Better Way to Handle Large Data

In dbt 1.9, the microbatch incremental strategy is a new way to process massive datasets. In earlier versions of dbt, incremental materialization was available to process datasets which were too large to drop and recreate at every build. However, it struggled to efficiently manage very large datasets that are too large to fit into one query. This limitation led to timeouts and complex query management.

The microbatch incremental strategy comes to the rescue by breaking large datasets into smaller chunks for processing using the batch_size, event_time, and lookback configurations to automatically generate the necessary filters for you. However, at the time of this publication this feature is only available on the following adapters: Postgres, Redshift, Snowflake, BigQuery, Spark, and Databricks, with more on the way.  

Key Benefits of Microbatching

  • Simplified Query Design: As mentioned earlier, dbt will handle the logic for your batch data using simple, yet powerful configurations. By setting the event_time, lookback, and batch_size configurations dbt will generate the necessary filters for each batch. One less thing to worry about!  
  • Independent Batch Processing: dbt automatically splits your data into smaller chunks based on the batch_size you set. Each batch is processed separately and in parallel, unless you disable this feature using the +concurrent_batches config. This independence in batch processing improves performance, minimizes the risk of query failures, allows you to retry failed batches using the dbt retry command, and provides the granularity to load specific batches. Gotta love the control without the extra leg work!

Compatibility Note:  Custom microbatch macros

To take advantage of the microbatch incremental strategy, first upgrade to dbt 1.9 and ensure your project is configured correctly. By default, dbt will handle the microbatch logic for you, as explained above. However, if you’re using custom logic, such as a custom microbatch macro, don’t forget to  set the require_batched_execution_for_custom_microbatch_strategy behavior flag to True in your dbt_project.yml file. This prevents deprecation warnings and ensures dbt knows how to handle your custom configuration.

If you have custom microbatch but wish to migrate, its important to note that earlier versions required setting the environment variable DBT_EXPERIMENTAL_MICROBATCH to enable microbatching, but this is no longer needed. Starting with Core 1.9, the microbatch strategy works seamlessly out of the box, so you can remove it.

Enhanced Snapshots: Smarter and More Flexible Data Tracking

With dbt 1.9, snapshots have become easier to use than ever! This is great news for dbt users since snapshots in dbt allow you to capture the state of your data at specific points in time, helping you track historical changes and maintain a clear picture of how your data evolves.  Below are a couple of improvements to implement or be aware of.

Key Improvements in Snapshots

  • YAML Configurations: Snapshots can now be defined directly in YAML files. This makes them easier to manage, read, and update, allowing for a more streamlined configuration process that aligns with other dbt project components. Lots of things are easier in YAML. 😉
  • Customizable Metadata Fields: With the snapshot_meta_column_names config you now have the option to rename metadata fields to match your project's naming conventions. This added flexibility helps ensure consistency across your data models and simplifies collaboration within teams.  
  • Default target_schema: If you do not specify a schema for your snapshots, dbt will use the schema defined for the current environment. This means that snapshots will be created in the default schema associated with your dbt environment settings.
  • Standardization of resource type: Snapshots now support the standard schema and database configurations, similar to models and seeds. This standardization allows you to define where your snapshots are stored using familiar configuration patterns.
  • New Warnings: You will now get a warning if you set an incorrect updated_at data type. This ensures it is an accepted data type or timestamp. No more silent error.  
  • Set an expiration date: Before dbt 1.9 the dbt_valid_to variable is set to NULL but you can now you can configure it to a data with the dbt_valid_to_current config. It is important to note that dbt will not automatically adjust the current value in the existing dbt_valid_to column. Meaning, any existing current records will still have dbt_valid_to set to NULL and new records will have this value set to your configured date.  You will have to manually update existing data to match. Less NULL values to handle downstream!  
  • dbt snapshot–empty: In dbt 1.9, the --empty flag is now supported for the dbt snapshot command, allowing you to execute snapshot operations without processing data. This enhancement is particularly useful in Continuous Integration (CI) environments, enabling the execution of unit tests for models downstream of snapshots without requiring actual data processing, streamlining the testing process. The empty flag, introduced in dbt 1.8, also has some powerful applications in Slim CI to optimize your CI/CD worth checking out.
  • Improved Handling of Deleted Records: In dbt version 1.9, the hard_deletes configuration enhances the management of deleted records in snapshots. This feature offers three methods: the default ignore, which takes no action on deleted records; invalidate, replacing the invalidate_hard_deletes=trueconfig, which marks deleted records as invalid by setting their dbt_valid_to timestamp to the current time; and lastly new_record, which tracks deletions by inserting a new record with a dbt_is_deleted config set to True.  

Compatibility Note:  hard_deletes

It's important to note some migration efforts will be required for this. While the invalidate_hard_deletes configuration is still supported for existing snapshots, it cannot be used alongside hard_deletes. For new snapshots, it's recommended to use hard_deletes instead of the legacy invalidate_hard_deletes. If you switch an existing snapshot to use hard_deletes without migrating your data, you may encounter inconsistent or incorrect results, such as a mix of old and new data formats. Keep this in mind when implementing these new configs.

Unit Testing Enhancements: Streamlined Testing for Better Data Quality

Testing is a vital part of maintaining high data quality and ensuring your data models work as intended. Unit testing was introduced in dbt 1.8 and has seen continued improvement in dbt 1.9.  

Key Enhancements in Unit Testing:

  • Selective Testing with Unit Test Selectors: dbt 1.9 introduces a new selection method for unit tests, allowing users to target specific unit tests directly using the unit_test: selector. This feature enables more granular control over test execution, allowing you to focus on particular tests without running the entire suite, thereby saving time and resources.
dbt test --select unit_test:my_project.my_unit_test 

dbt build --select unit_test:my_project.my_unit_test 
  • Improved Resource Type Handling: The update ensures that commands like dbt list --resource-type test now correctly include only data tests, excluding unit tests. This distinction enhances clarity and precision when managing different test types within your project.  
dbt ls --select unit_test:my_project.my_unit_test 

Slim CI State Modifications: Smarter and More Accurate Workflows

In dbt version 1.9, the state:modified selector has been enhanced to improve the accuracy of Slim CI workflows. Previously, dynamic configurations—such as setting the database based on the environment—could lead to dbt perceiving changes in models, even when the actual model remained unchanged. This misinterpretation caused Slim CI to rebuild all models unnecessarily, resulting in false positives.

dynamic database
dbt dynamic configurations

By comparing unrendered configuration values, dbt now accurately detects genuine modifications, eliminating false positives during state comparisons. This improvement ensures that only truly modified models are selected for rebuilding, streamlining your CI processes.

Key Benefits:

  • Improved Accuracy: Focusing on unrendered configurations reduces false positives during state comparisons.
  • Streamlined CI Processes: Enhanced change detection allows CI workflows to concentrate solely on resources that require updates or testing.
  • Time and Resource Efficiency: Minimizing unnecessary computations conserves both time and computational resources.

To enable this feature, set the state_modified_compare_more_unrendered_values flag to True in your dbt_project.yml file:

flags: 
	state_modified_compare_more_unrendered_values: True 

Enhanced Documentation Hosting with --host Flag in dbt 1.9

In dbt 1.9, the dbt docs serve command now has more customization abilities with a new --host flag. This flag allows users to specify the host address for serving documentation. Previously, dbt docs serve defaulted to binding the server to 127.0.0.1 (localhost) without an option to override this setting.  

Users can now specify a custom host address using the --host flag when running dbt docs serve. This enhancement provides the flexibility to bind the documentation server to any desired address, accommodating various deployment needs. The default of the --host flag will continue to bind to 127.0.0.1 by default, ensuring backward compatibility and secure defaults.

Key Benefits:

  • Deployment Flexibility: Users can bind the documentation server to different host addresses as required by their deployment environment.
  • Improved Accessibility: Facilitates access to dbt documentation across various network configurations by enabling custom host bindings.
  • Enhanced Compatibility: Addresses previous limitations and resolves issues encountered in deployments that require non-default host bindings.

Other Notable Improvements in dbt 1.9

dbt 1.9 includes several updates aimed at improving performance, usability, and compatibility across projects. These changes ensure a smoother experience for users while keeping dbt aligned with modern standards.

  • Iceburg table  support: With dbt 1.9, you can now add Iceberg table support to table, incremental, dynamic table materializations.
  • Optimized dbt clone Performance: The dbt clone command now executes clone operations concurrently, enhancing efficiency and reducing execution time.
  • Parseable JSON and Text Output in Quiet Mode: The dbt show and dbt compile commands now support parseable JSON and text outputs when run in quiet mode, facilitating easier integration with other tools and scripts by providing machine-readable outputs.
  • skip_nodes_if_on_run_start_fails Behavior Change Flag: A new behavior change flag, skip_nodes_if_on_run_start_fails, has been introduced to gracefully handle failures in on-run-start hooks. When enabled, if an on-run-start hook fails, subsequent hooks and nodes are skipped, preventing partial or inconsistent runs.

Compatibility Note:  Sans Python 3.8

  • Python 3.8 Support Removed: dbt 1.9 no longer supports Python 3.8, encouraging users to upgrade to newer Python versions. This ensures compatibility with the latest features and enhances overall performance.  

Conclusion

dbt 1.9 introduces a range of powerful features and enhancements, reaffirming its role as a cornerstone tool for modern data transformations.  The enhancements in this release reflect the community's commitment to innovation and excellence as well as its strength and vitality. There's no better time to join this dynamic ecosystem and elevate your data workflows!

If you're looking to implement dbt efficiently, consider partnering with Datacoves. We can help you reduce your total cost of ownership by 50% and accelerate your time to market. Book a call with us today to discover how we can help your organization in building a modern data stack with minimal technical debt.

Checkout the full release notes.

dbt and airflow
5 mins read

dbt and Airflow are cornerstone tools in the modern data stack, each excelling in different areas of data workflows. Together, dbt and Airflow provide the flexibility and scalability needed to handle complex, end-to-end workflows.

This article delves into what dbt and Airflow are, why they work so well together, and the challenges teams face when managing them independently. It also explores how Datacoves offers a fully managed solution that simplifies operations, allowing organizations to focus on delivering actionable insights rather than managing infrastructure.

What is dbt?

dbt (Data Build Tool) is an open-source analytics engineering framework that transforms raw data into analysis-ready datasets using SQL. It enables teams to write modular, version-controlled workflows that are easy to test and document, bridging the gap between analysts and engineers.

  • Adoption: With over 40,000 companies using dbt, the majority rely on open-source dbt Core available to anyone.
  • Key Strength: dbt empowers anyone with SQL knowledge to own the logic behind data transformations, giving them control over cleansing data and delivering actionable insights.
  • Key Weakness: Teams using open-source dbt on their own must manage infrastructure, developer environments, job scheduling, documentation hosting, and the integration of tools for loading data into their data warehouse.  

What is Airflow?

Apache Airflow is an open-source platform designed to orchestrate workflows and automate tasks. Initially created for ETL processes, it has evolved into a versatile solution for managing any sequence of tasks in data engineering, machine learning, or beyond.

  • Adoption: With over 37,000 stars on GitHub, Airflow is one of the most popular orchestration tools, seeing thousands of downloads every month.
  • Key strength: Airflow excels at handling diverse workflows. Organizations use it to orchestrate tools like Azure Data Factory, Amazon Glue, and open-source options like dlt (data load tool). Airflow can trigger dbt transformations, post-transformation processes like refreshing dashboards, or even marketing automation tasks. Its versatility extends to orchestrating AI and ML pipelines, making it a go-to solution for modern data stacks.
  • Key weakness: Scaling Airflow often requires running it on Kubernetes for its scalable nature. However, this introduces significant operational overhead and a steep learning curve to configure and maintain the Kubernetes cluster.

Why dbt and Airflow are a natural pair

Stitch together disjointed schedules

While dbt excels at SQL-based data transformations, it has no built-in scheduler, and solutions like dbt Cloud’s scheduling capabilities are limited to triggering jobs in isolation or getting a trigger from an external source. This approach risks running transformations on stale or incomplete data if upstream processes fail. Airflow eliminates this risk by orchestrating tasks across the entire pipeline, ensuring transformations occur at the right time as part of a cohesive, integrated workflow.

Tools like Airbyte and Fivetran also provide built-in schedulers, but these are designed for loading data at a given time and optionally trigger a dbt pipeline. As complexity grows and organizations need to trigger dbt pipelines after data loads via different means such as dlt and Fivetran, then this simple approach does not scale. It is also common to trigger operations after a dbt pipeline and scheduling using the data loading tool will not handle that complexity. With dbt and Airflow, a team can connect the entire process and assure that processes don’t run if upstream tasks fail or are delayed.

Airflow centralizes orchestration, automating the timing and dependencies of tasks—extracting and loading data, running dbt transformations, and delivering outputs. This connected approach reduces inefficiencies and ensures workflows run smoothly with minimal manual intervention.

Handle complexity with ease

Modern data workflows extend beyond SQL transformations. Airflow complements dbt by supporting complex, multi-stage processes such as integrating APIs, executing Python scripts, and training machine learning models. This flexibility allows pipelines to adapt as organizational needs evolve.

Airflow also provides a centralized view of pipeline health, offering data teams complete visibility. With its ability to trace issues and manage dependencies, Airflow helps prevent cascading failures and keeps workflows reliable.

By combining dbt’s transformation strengths with Airflow’s orchestration capabilities, teams can move past fragmented processes. Together, these tools enable scalable, efficient analytics workflows, helping organizations focus on delivering actionable insights without being bogged down by operational hurdles.

Managed Airflow and managed dbt in Datacoves

In our previous article, we discussed building vs buying your Airflow and dbt infrastructure. There are many cons associated with self-hosting these two tools, but Datacoves takes the complexity out of managing dbt and Airflow by offering a fully integrated, managed solution. Datacoves has given many organizations the flexibility of open-source tools with the freedom of managed tools. See how we helped Johnson and Johnson MedTech migrate to our managed dbt and airflow platform.

Managed dbt

Datacoves offers the most flexible and robust managed dbt Core environment on the market, enabling teams to fully harness the power of dbt without the complexities of infrastructure management, environment setup, or upgrades. Here’s why our customers choose Datacoves to implement dbt:

  • Seamless VS Code environment: Users can log in to a secure, browser-based VS Code development environment and start working immediately. With access to the terminal, VS Code extensions, packages, and libraries, developers have the full power of the tools they already know and love—without the hassle of managing local setups. Unlike inflexible custom IDEs, the familiar and flexible VS Code environment empowers developers to work efficiently. Scaling and onboarding new analytics engineers is streamlined so they can be productive in minutes.
Real-time SQL linting
Real-time SQL linting
  • Optimized for dbt development: Datacoves is designed to enhance the dbt development experience with features like SQL formatting, autocomplete, linting, compiled dbt preview, curated extensions, and python libraries. It ensures teams can develop efficiently and maintain high standards for their project.
  • Effortless upgrade management: Datacoves manages both platform and version upgrades. Upgrades require minimal work from the data teams and is usually as simple as “change this line in this file.”  
  • CI/CD accelerators: Many teams turn to Datacoves after outgrowing the basic capabilities of dbt Cloud CI. Datacoves integrates seamlessly with leading CI/CD tools like GitHub Actions, GitLab Workflows, and Jenkins. But we don’t stop at providing the tools—we understand that setting up and optimizing these pipelines requires expertise. That’s why we work closely with our customers to implement robust CI/CD pipelines, saving them valuable time and reducing costs.
  • dbt best practices and guidance: Datacoves provides accelerators and starting points for dbt projects, offering teams a strong foundation to begin their work or improve their project following best practices. This approach has helped teams minimize technical debt and ensure long-term project success. As an active and engaged member of the dbt community, Datacoves stays up to date on new improvements and changes. Supporting customers by providing expert guidance on required updates and optimizations.

Managed Airflow

Datacoves offers a fully managed Airflow environment, designed for scalability, reliability, and simplicity. Whether you're orchestrating complex ETL workflows, triggering dbt transformations, or integrating with third-party APIs, Datacoves takes care of the heavy lifting by managing the Kubernetes infrastructure, monitoring, and scaling. Here’s what sets Datacoves apart as a managed Airflow solution:

  • Multiple Airflow environments: Teams can seamlessly access their hosted Airflow UI and easily set up dedicated development and production instances. Managing secrets is simplified with secure options like Datacoves Secrets Manager or AWS Secrets Manager, enabling a streamlined and secure workflow without the logistical headaches.
Airflow in Datacoves
Airflow in Datacoves
  • Observability: With built-in tools like Grafana, teams gain comprehensive visibility into their Airflow jobs and workflows. Monitor performance, identify bottlenecks, and troubleshoot issues with ease—all without the operational overhead of managing Kubernetes clusters or runtime errors.
  • Upgrade management: Upgrading Airflow is simple and seamless. Teams can transition to newer Airflow versions without downtime or complexity.
  • Git sync/S3 sync: Users can effortlessly sync their Airflow DAGs using two popular methods—Git Sync or S3 Sync—without needing to worry about the complexities of setup or configuration.
  • My Airflow: Datacoves offers My Airflow, a standalone Airflow instance that lets users instantly test and develop DAGs at the push of a button. This feature provides developers with the freedom to experiment safely without affecting development or production Airflow instances.
Start My Airlfow
Start My Airlfow

  • Airflow best practices and guidance: Datacoves provides expert guidance on DAG optimization and Airflow best practices, ensuring your organization avoids costly technical debt and gets it right from the start.

Conclusion

dbt and Airflow are a natural pair in the Modern Data Stack. dbt’s powerful SQL-based transformations enable teams to build clean, reliable datasets, while Airflow orchestrates these transformations within a larger, cohesive pipeline. Their combination allows teams to focus on delivering actionable insights rather than managing disjointed processes or stale data.

However, managing these tools independently can introduce challenges, from infrastructure setup to scaling and ongoing maintenance. That’s where platforms like Datacoves make a difference. For organizations seeking to unlock the full potential of dbt and Airflow without the operational overhead, solutions like Datacoves provide the scalability and efficiency needed to modernize data workflows and accelerate insights.

Book a call today to see how Datacoves can help your organization realize the power of Airflow and dbt.

10 items to consider when choosing a data migration partner
5 mins read

The world of data moves at a lightning-fast pace, and you may be looking to keep up by migrating your data to a modern infrastructure. As you plan your data migration, you’ll quickly see the many moving parts involved, from data compatibility and security to performance optimization. Choosing the right partner is critical—making the wrong choice can lead to data loss or corruption, compliance failures, project delays, hidden costs and more. At worst, you could end up with a costly new process that fails to gain user adoption! This article provides 10 key factors to consider in a partner to ensure these pitfalls don’t happen to you, guiding you toward a smooth and successful migration. Lets dive in!  

What is data migration?

Data migration is the process of moving data pipelines from one platform to another. This process can include upgrading or replacing legacy platforms, performing critical maintenance, or transitioning to new infrastructure such as a cloud platform. Whether it's moving data to a modern data center or migrating workloads to the cloud, data migration is a pivotal undertaking that demands meticulous planning and execution.

Organizations may embark on this complex journey for many reasons. A common driver is the need to modernize and adopt cutting-edge solutions like cloud platforms such as Snowflake, which offer unparalleled scalability, performance, and the flexibility of ephemeral resources. Data migration may also be necessitated by mergers and acquisitions, where consolidating and standardizing data across multiple systems becomes essential for unified operations. Additionally, organizations might pursue migration to improve security, streamline workflows, or boost analytics capabilities.

Done right, data migration can be transformative, enhancing data usage and enabling organizations to unlock new opportunities for efficiency, deeper insights, and strategic growth.

The complexity of data migration

Migrating data is a complex undertaking with many moving parts that vary based on your current system and the target system. Careful assessment of your current state and your desired future state is a critical step that should never be overlooked in this planning process. Key considerations include data security, optimizing configurations in the new environment, and transitioning existing pipelines seamlessly. Joe Reis and Matt Housley often emphasize that much of data engineering revolves around "plumbing"—the foundational connections and data flows—which must be meticulously managed for any successful migration.

A lift-and-shift approach, where pipelines are simply moved without modifications, should be avoided as much as possible. This method often undermines the purpose of migrating in the first place: to capitalize on modern features and enhancements offered by newer tools, such as dbt, to improve data quality, documentation, and impact analysis. Moving to dbt without re-thinking how data is cleansed and transformed can lead to outcomes that are worse than your current state such as increased compute costs and difficulty in debugging issues.

Given these complexities, detailed planning, skilled execution, prioritization, decommissioning unused assets, and effective risk management are crucial for a successful migration. Achieving this demands experienced professionals who can execute flawlessly while remaining adaptable to unexpected challenges.  

The risks of choosing the wrong partner

As we have seen above, there are many complexities when it comes to data migration, making the selection of the right partner paramount. Choosing the wrong partner can potentially lead to longer implementation times, hidden costs, project failure, compliance failures, data loss and corruption, and lost opportunity costs. Let’s discuss each of these in a little more detail.  

Longer time to implementation

Inexperienced partners can cause significant delays due to suboptimal choices in planning, technology selection, and execution. These inefficiencies can lead to frequent setbacks, resource mismanagement, and potential catastrophic roadblocks. Prolonged implementation timelines may also result in missed opportunities to capture market value and reduce time-to-insight, while eroding trust in a system that has yet to be fully implemented.

Hidden costs

Hiring the wrong partner often results in unforeseen costs due to extended project timelines as mentioned above, poor resource allocation, and the need for rework when initial efforts fall short. These hidden costs may include increased labor expenses, additional technology investments to rectify poor initial solutions, and higher costs associated with resolving data security or compliance issues.  Budget overruns and unexpected expenses from lack of foresight, poor risk management, and inefficiency can quickly erode ROI.

Project failure

A poorly executed data migration can lead to a new process that underperforms, costs more, or fails to gain user adoption. When users reject a poorly implemented system, organizations may be forced to maintain legacy systems, further compounding costs and delaying innovation. Worse still, critical data may be unusable or inconsistent, undermining trust in data-driven initiatives.

Compliance failures

Hiring the right partner is essential for ensuring compliance with data regulations, industry standards, and security best practices. Without expertise in these areas, there is a heightened risk of data breaches, non-compliance fines, and reputational damage due to mishandling sensitive information. Such failures can lead to costly legal ramifications, operational downtime, and diminished customer trust.

Data loss or corruption

Inadequate planning, testing, or execution can result in the loss or corruption of critical data during migration. Poor data management practices, such as insufficient backups, improper mapping of data fields, or inadequate validation procedures, can compromise data integrity and create gaps in your data sets. Data loss and corruption can disrupt business operations, degrade analytics capabilities, and require extensive rework to correct.

Missed optimization opportunities

Choosing the wrong partner can lead to missed opportunities for optimizing data processes, modernizing workflows, and unlocking valuable business insights. Every moment spent fixing issues or addressing inefficiencies due to poor implementation represents lost time that could have been invested in enhancing data quality, streamlining operations, and driving strategic initiatives. This opportunity cost is often overlooked but can be the difference between gaining a competitive edge and falling behind.

10 key factors to consider when choosing a data migration partner

Datacoves does not do data migrations, but we see companies hire companies to do this work as they implement our platform. Through our experience, we have compiled a list of 10 key factors to consider when selecting a data migration partner. Carefully evaluating these factors can significantly increase the likelihood of success for your data migration plan and ensure a smoother overall process.

1. Proven track record of success

When selecting a data migration partner, it’s crucial to thoroughly review their case studies, references, and client testimonials. Focus on case studies that feature companies with similar starting points and objectives to your own. Approach client testimonials with a discerning eye and validate their claims by contacting references. This is an excellent opportunity to determine whether the partner is merely focused on checking tasks off a to-do list or genuinely dedicated to setting things up correctly the first time, with a passion for leaving your organization in a strong position. While this may seem like a considerable effort, such diligence is essential for investing in your data’s success and ensuring the partner can deliver on their promises.

2. Deep technical expertise

Building on the importance of a proven track record from above, this factor emphasizes the need for technical depth. Verify that your potential partner is proficient in overarching data terminology and best practices, with deep familiarity in areas such as data architecture, data modeling, data governance, data integration, and security protocols. A qualified data partner must have the expertise necessary to successfully guide you through every phase of your data migration. Skipping this crucial step can lead to poorly structured data, compromised system performance, and numerous missed opportunities for optimization.

3. Effective project management communication and collaboration skills

This is often overlooked when selecting a data migration partner, yet it plays a critical role in ensuring a successful project. When evaluating potential partners, consider asking the following questions to assess their project management and communication capabilities:

  • How do you structure the migration process?
  • Will you provide regular sprint updates to keep us informed of progress?
  • How transparent are you about the use of billable hours?
  • Do you offer dashboards or tools that keep us updated and provide comprehensive data plans with clear, actionable timelines that we can follow and provide feedback on?
  • How will you collaborate with our team to ensure a seamless workflow and maintain clear, consistent roadmaps?
  • If deviations from the initial plan become necessary, how do you communicate and manage such changes?

This is by no means an exhaustive list of questions but rather a great starting point. The right partner should feel like a leader rather than a liability, demonstrating their expertise in a proactive manner. This ensures you don’t have to constantly direct their work but can trust them to drive the project forward effectively.

4. Industry-specific knowledge

A common theme for a successful partnership is deep expertise, and this is especially true for industry-specific knowledge. Every industry has its unique challenges and pitfalls when it comes to data. It is important to seek out partners who are experts in your industry and have a proven track record of successfully guiding similar organizations to their goals. For example, if your organization operates within the Health and Life Sciences sector, a partner with experience exclusively in Retail may lack the nuanced understanding required for your specific data needs, such as handling PII data, adhering to stringent regulatory compliance, or managing complex clinical trial data. While industry familiarity shouldn’t necessarily be a dealbreaker for every organization, it can be critical for sectors like Health and Life Sciences due to their high regulatory demands. Other industries may find it less restrictive, which is why it remains a key factor to consider when finding the right fit. See how Datacoves helped J&J achieve a 66% reduction in data processing with their Modern Data Platform, best practices, and accelerators.

5. Comprehensive risk mitigation strategy

A partner's ability to minimize downtime, prevent data loss, and mitigate security risks throughout the migration process is essential to avoiding catastrophic consequences such as prolonged system outages, data breaches, or compliance failures. A comprehensive risk mitigation strategy ensures that every aspect of your data migration is thoughtfully planned and executed with contingencies in place. Ask potential partners how they approach risk assessment, what protocols they follow to maintain data integrity, and how they handle unexpected issues. The right partner will proactively identify potential risks and implement measures to address them, providing you with peace of mind during what can be an otherwise complex and challenging process.

6. Flexibility and customization

A successful data migration partner should offer tailored solutions rather than relying on one-size-fits-all approaches. Every organization’s data needs are unique, and flexibility in meeting those needs is extremely important. Consider how a partner adapts their strategy and tools to align with your specific requirements, workflows, and constraints. Do they take the time to understand your goals and develop a plan accordingly, or do they push prepackaged solutions? The ability to customize their approach can be the difference between a migration that delivers optimal business value and one that merely "gets the job done."  

7. Long-term support and optimization capabilities

Data migration doesn’t end with the initial project. A strong partner should offer ongoing support, optimization, and strategic guidance post-migration to ensure continued value from your data infrastructure. Ask about their approach to post-migration support: Will they provide continued monitoring, performance optimization, and assistance—and for how long? The best partners view your success as an ongoing journey, bringing the expertise needed to continuously refine and enhance your data systems. Their commitment to getting things right the first time minimizes future issues and demonstrates a vested interest in your long-term success. By prioritizing a forward-thinking approach, they ensure your data systems are built to last, rather than quickly implemented and forgotten. This is why Datacoves goes beyond just providing tools; we offer accelerators and best practices designed to help you implement dbt successfully, ensuring a strong foundation for your data transformation journey. We work with strategic migration partners that will help you set things up the right way and are around for the long haul.

8. Time zone overlap

For many organizations, the geographic location of a data migration partner can impact communication and project efficiency. Consider whether the partner’s working hours overlap with yours. How will they handle urgent requests or collaboration across different time zones? Effective time zone alignment can enhance communication, reduce delays, and ensure faster resolution of issues. The last thing you want is to find an issue and not be able to get an answer until the next day.  

9. Change management focus

Successful data migration extends beyond the technical execution and tooling—it also requires effective change management. A capable partner will help your organization navigate the changes associated with data migration, including new processes, systems, and ways of working. How do they support employee training, communication, and adoption of new tools? Do they provide resources and strategies to ensure a smooth transition? Partners with a strong change management focus will work with you to minimize disruptions and maximize user adoption.

10.Certification

When evaluating potential partners, keep in mind that while their team lead may be highly technical, the team members you’ll work with day-to-day might not always match that level of expertise. Ensure that the team members working on your project possess relevant certifications for the key technologies you use. Certifications, such as dbt Certification, Snowflake Certification, or other relevant credentials, demonstrate expertise and a commitment to staying current with industry standards and best practices. Ask potential partners to provide proof of certification and inquire about how their team keeps pace with evolving technologies. While certifications alone don’t guarantee proficiency, they offer a solid starting point for assessing skill and commitment. This assurance of expertise can significantly impact the success of your project.

Don’t skimp on cost

Cost should not be the determining factor when hiring a migration partner. Cost is an essential consideration as it will directly impact project budget, but you must consider the total cost of ownership of your new platform.  In the long term, the initial migration cost will impact the long-term on-going costs. A low-cost partner will lack several of the items listed above and your migration team may be staffed with inexperienced team members. The migration will be done, but how much technical debt will you accumulate along the way?

Avoid simply searching for the lowest-cost vendor. Though this may lower upfront expenses, it often results in higher costs over time due to errors, inefficiencies, and the need for rework. Projects that are rushed or handled without proper expertise tend to exceed their budgets, take longer to complete, and are more challenging to maintain in the long run because they weren’t done correctly or optimized from the start. Experienced partners bring significant value by ensuring work is done right and to a high standard from the beginning. It is obvious that contracting a partner that meets most, if not all, of the key factors mentioned above most likely requires a monetary investment. This should be viewed as an investment in expertise that helps mitigate long-term costs and risks.

Conclusion

Choosing the right data migration partner is key to minimizing risks and ensuring optimal outcomes for your organization. The complexities and challenges of data migration demand a partner with proven expertise, industry-specific knowledge, effective communication, flexibility, and a commitment to long-term support. Each of the factors outlined above plays a vital role in determining the success of your migration project—potentially saving your organization from costly delays, hidden expenses, compliance pitfalls, and lost business opportunities.

Carefully evaluate potential partners using these key considerations to ensure you select a partner who will not only meet your immediate data migration needs but also support your organization’s continued success and growth. 📈

Can the right data migration platform cut costs and speed up delivery?

Datacoves has built-in best practices and accelerators built from our deep expertise in dbt, Airflow, and Snowflake. Our platform is designed to simplify your data transformation journey while providing excellent value by reducing your reliance on costly consultants. With our baked-in best practices, our customers have achieved faster implementations, enhanced efficiency, and long-term scalability.

optimize dbt slim ci
5 mins read

Any experienced data engineer will tell you that efficiency and resource optimization are always top priorities. One powerful feature that can significantly optimize your dbt CI/CD workflow is dbt Slim CI. However, despite its benefits, some limitations have persisted. Fortunately, the recent addition of the --empty flag in dbt 1.8 addresses these issues. In this article, we will share a GitHub Action Workflow and demonstrate how the new --empty flag can save you time and resources.

What is dbt Slim CI?

dbt Slim CI is designed to make your continuous integration (CI) process more efficient by running only the models that have been changed and their dependencies, rather than running all models during every CI build. In large projects, this feature can lead to significant savings in both compute resources and time.

Key Benefits of dbt Slim CI

  • Speed Up Your Workflows: Slim CI accelerates your CI/CD pipelines by skipping the full execution of all dbt models. Instead, it focuses on only the modified models and their dependencies and uses the defer flag to pull the unmodified models from production. So, if we have model A, B and C yet only make changes to C, then only model C will be run during the CI/CD process.
  • Save Time, Snowflake Credits, and Money: By running only the necessary models, Slim CI helps you save valuable build time and Snowflake credits. This selective approach means fewer computational resources are used, leading to cost savings.

dbt Slim CI flags explained

dbt Slim CI is implemented efficiently using these flags:

--select state:modified:  The state:modified selector allows you to choose the models whose "state" has changed (modified) to be included in the run/build. This is done using the state:modified+ selector which tells dbt to run only the models that have been modified and their downstream dependencies.

--state <path to production manifest>: The --state flag specifies the directory where the artifacts from a previous dbt run are stored ie) the production dbt manifest. By comparing the current branch's manifest with the production manifest, dbt can identify which models have been modified.

--defer: The --defer flag tells dbt to pull upstream models that have not changed from a different environment (database). Why rebuild something that exists somewhere else? For this to work, dbt will need access to the dbt production manifest.

dbt build
dbt CI/CD command

You may have noticed that there is an additional flag in the command above.  

--fail-fast: The --fail-fast flag is an example of an optimization flag that is not essential to a barebones Slim CI but can provide powerful cost savings. This flag stops the build as soon as an error is encountered instead of allowing dbt to continue building downstream models, therefore reducing wasted builds. To learn more about these arguments you can use have a look at our dbt cheatsheet.

dbt Slim CI with Github Actions before dbt 1.8

The following sample Github Actions workflow below is executed when a Pull Request is opened. ie) You have a feature branch that you want to merge into main.

sample Github Actions workflow is executed when a Pull Request is opened
Same Github Action Prior to dbt 1.8

Workflow Steps

Checkout Branch: The workflow begins by checking out the branch associated with the pull request to ensure that the latest code is being used.

Set Secure Directory: This step ensures the repository directory is marked as safe, preventing potential issues with Git operations.  

List of Files Changed: This command lists the files changed between the PR branch and the base branch, providing context for the changes and helpful for debugging.

Install dbt Packages: This step installs all required dbt packages, ensuring the environment is set up correctly for the dbt commands that follow.

Create PR Database: This step creates a dedicated database for the PR, isolating the changes and tests from the production environment.

Get Production Manifest: Retrieves the production manifest file, which will be used for deferred runs and governance checks in the following steps.

Run dbt Build in Slim Mode or Run dbt Build Full Run: If a manifest is present in production, dbt will be run in slim mode with deferred models. This build includes only the modified models and their dependencies. If no manifest is present in production we will do a full refresh.

Grant Access to PR Database: Grants the necessary access to the new PR database for end user review.

Generate Docs Combining Production and Branch Catalog: If a dbt test is added to a YAML file, the model will not be run, meaning it will not be present in the PR database. However, governance checks (dbt-checkpoint) will need the model in the database for some checks and if not present this will cause a failure. To solve this, the generate docs step is added to merge the catalog.json from the current branch with the production catalog.json.

Run Governance Checks: Executes governance checks such as SQLFluff and dbt-checkpoint.

Problems with the dbt CI/CD Workflow

As mentioned in the beginning of the article, there is a limitation to this setup. In the existing workflow, governance checks need to run after the dbt build step. This is because dbt-checkpoint relies on the manifest.json and catalog.json. However, if these governance checks fail, it means that the dbt build step will need to run again once the governance issues are fixed. As shown in the diagram below, after running our dbt build, we proceed with governance checks. If these checks fail, we need to resolve the issue and re-trigger the pipeline, leading to another dbt build. This cycle can lead to unnecessary model builds even when leveraging dbt Slim CI.

ci/cd process before dbt 1.8
dbt CI/CD before dbt 1.8

Leveraging the --empty Flag for Efficient dbt CI/CD Workflows

The solution to this problem is the --empty flag in dbt 1.8. This flag allows dbt to perform schema-only dry runs without processing large datasets. It's like building the wooden frame of a house—it sets up the structure, including the metadata needed for governance checks, without filling it with data. The framework is there, but the data itself is left out, enabling you to perform governance checks without completing an actual build.

Let’s see how we can rework our Github Action:

rework our Github Action
Sample Github Action with dbt 1.8

Workflow Steps

Checkout Branch: The workflow begins by checking out the branch associated with the pull request to ensure that the latest code is being used.

Set Secure Directory: This step ensures the repository directory is marked as safe, preventing potential issues with Git operations.  

List of Files Changed: This step lists the files changed between the PR branch and the base branch, providing context for the changes and helpful for debugging.

Install dbt Packages: This step installs all required dbt packages, ensuring the environment is set up correctly for the dbt commands that follow.

Create PR Database: This command creates a dedicated database for the PR, isolating the changes and tests from the production environment.

Get Production Manifest: Retrieves the production manifest file, which will be used for deferred runs and governance checks in the following steps.

*NEW* Governance Run of dbt (Slim or Full) with EMPTY Models: If there is a manifest in production, this step runs dbt with empty models using slim mode and using the empty flag. The models will be built in the PR database with no data inside and we can now use the catalog.json to run our governance checks since the models. Since the models are empty and we have everything we need to run our checks, we have saved on compute costs as well as run time.  

Generate Docs Combining Production and Branch Catalog: If a dbt test is added to a YAML file, the model will not be run, meaning it will not be present in the PR database. However, governance checks (dbt-checkpoint) will need the model in the database for some checks and if not present this will cause a failure. To solve this, the generate docs step is added to merge the catalog.json from the current branch with the production catalog.json.

Run Governance Checks: Executes governance checks such as SQLFluff and dbt-checkpoint.

Run dbt Build: Runs dbt build using either slim mode or full run after passing governance checks.

Grant Access to PR Database: Grants the necessary access to the new PR database for end user review.

By leveraging the dbt --empty flag, we can materialize models in the PR database without the computational overhead, as the actual data is left out. We can then use the metadata that was generated during the empty build. If any checks fail, we can repeat the process again but without the worry of wasting any computational resources doing an actual build. The cycle still exists but we have moved our real build outside of this cycle and replaced it with an empty or fake build. Once all governance checks have passed, we can proceed with the real dbt build of the dbt models as seen in the diagram below.

ci/cd process after dbt 1.8
dbt CI/CD after dbt 1.8 --empty flag

Conclusion

dbt Slim CI is a powerful addition to the dbt toolkit, offering significant benefits in terms of speed, resource savings, and early error detection. However, we still faced an issue of wasted models when it came to failing governance checks. By incorporating dbt 1.8’s  --empty flag into your CI/CD workflows we can reduce wasted model builds to zero, improving the efficiency and reliability of your data engineering processes.

🔗 Watch the vide where Noel explains the  --empty flag implementation in Github Actions:

Dbt jinja cheat sheet
5 mins read

Jinja templating in dbt offers flexibility and expressiveness that can significantly improve SQL code organization and reusability. There is a learning curve, but this cheat sheet is designed to be a quick reference for data practitioners, helping to streamline the development process and reduce common pitfalls.

Whether you're troubleshooting a tricky macro or just brushing up on syntax, bookmark this page. Trust us, it will come in handy and help you unlock the full potential of Jinja in your dbt projects.

If you find this cheat sheet useful, be sure to check out our Ultimate dbt Jinja Functions Cheat Sheet. It covers the specialized Jinja functions created by dbt, designed to enhance versatility and expedite workflows.

dbt Jinja: Basic syntax

This is the foundational syntax of Jinja, from how to comment to the difference between statements and expressions.

dbt Jinja: Variable assignment 

Define and assign variables in different data types such as strings, lists, and dictionaries.

dbt Jinja: White space control 

Jinja allows fine-grained control over white spaces in compiled output. Understand how to strategically strip or maintain spaces.

       

dbt Jinja: Control flow

In dbt, conditional structures guide the flow of transformations. Grasp how to integrate these structures seamlessly.

Control Flow
If/elif/else/endif
{%- if target.name == 'dev' -%}
{{ some code }}
{%- elif target.name == 'prod' -%}
{{ some other code }}
{%- else -%}
{{ some other code }}
{%- endif -%}

dbt Jinja: Looping

Discover how to iterate over lists and dictionaries. Understand  simple loop syntax or accessing loop properties.

Looping
Loop Syntax
{%- for item in my_iterable -%}
  --Do something with item
  {{ item }}
{%- endfor -%}
loop.last
This boolean is False unless the current iteration is the last iteration.
          {% for item in list %}
  {% if loop.last %}   
    --This is the last item
    {{ item }}
  {% endif %}
{% endfor %}
loop.first
A boolean that is True if the current iteration is the first iteration, otherwise False.
{% for item in list %}
  {% if loop.first %}
    --first item
    {{ item }}
  {% endif %}
{% endfor %}
loop.index
An integer representing the current iteration of the loop (1-indexed). So, the first iteration would have loop.index of 1, the second would be 2, and so on.
{% for item in list %}
   --This is item number
   {{ loop.index }}
{% endfor %}
Looping a List
{% set rating_categories = ["quality_rating",
                            "design_rating",
                            "usability_rating"] %}
SELECT product_id,
 {%- for col_name in rating_categories -%}
   AVG({{ col_name }}) as {{ column_name }}_average
   {%- if not loop.last  -%} 
     , 
   {%- endif -%}
 {%- endfor -%}
 FROM product_reviews
 GROUP BY 1

Compiled code
SELECT product_id,
   AVG(quality_rating) as quality_rating_average,
   AVG(design_rating) as design_rating_average,
   AVG(usability_rating) as usability_rating_average
FROM product_reviews
GROUP BY 1
Looping a Dictionary
{% set delivery_type_dict = {"a": "digital_download",
                             "b": "physical_copy"} %}
SELECT order_id,
{%- for type, column_name in delivery_type_dict.items() -%}
COUNT(CASE 
      WHEN delivery_method = '{{ type }}' THEN order_id 
      END) as {{ column_name }}_count
      {%- if not loop.last  -%}
       , 
      {%- endif -%}
      {%- endfor -%}
FROM order_deliveries
GROUP BY 1

SELECT order_id,
COUNT(CASE 
      WHEN delivery_method = 'a' THEN order_id 
      END) as digital_download_count,
COUNT(CASE 
      WHEN delivery_method = 'b' THEN order_id 
      END) as physical_copy_count
FROM order_deliveries
GROUP BY 1

dbt Jinja: Operators 

These logical and comparison operators come in handy, especially when defining tests or setting up configurations in dbt.

Logic Operators
and
{% if condition1 and condition2 %}
or
{% if condition1 or condition2 %}
not
{{  not condition1 }}

Comparison Operators
Equal To
{% if 1 == 2 %}
Not Equal To
{% if 1 != 2 %}
Greater Than
{% if 1 > 2 %}
Less Than
{% if 1 < 2 %}
Greater Than or Equal to
{% if 1 >= 2 %}
Less Than or Equal To
{% if 1 <= 2 %}

dbt Jinja: Variable tests

Within dbt, you may need to validate if a variable is defined or a if a value is odd or even. These Jinja Variable tests allow you to validate with ease.

Jinja Variable Tests
Is Defined
                  {% if my_variable is defined %}
-- Handle conditions when variable exists
{% endif %}
Is None

{% if my_variable is none %}
-- Handle absence of my_variable
{% endif %}
Is Even

{% if my_variable is even %}
-- Handle when my_variable is even
{% endif %}
Is Odd

{% if my_variable is odd %}
-- Handle when my_variable is odd
{% endif %}
Is a String

{% if my_variable is string %}
-- Handle when my_variable is a string
{% endif %}
Is a Number

{% if my_variable is number %}
-- Handle when my_variable is a number
{% endif %}

dbt Jinja: Creating macros & tests

Macros are the backbone of advanced dbt workflows. Review how to craft these reusable code snippets and also how to enforce data quality with tests.

Creating Macros & Tests
Define a Macro
Write your macros in your project's macros directory.
{% macro ms_to_sec(col_name, precision=3) %}   
  ( {{ col_name }} / 1000 )::numeric(16, {{ precision }})   
{% endmacro %}
Use a Macro from a Model
In a model:
SELECT order_id,       
  {{ ms_to_sec(col_name=time_ms, precision=3) }} as time_sec
FROM order_timings;

Compiled code:
SELECT order_id,
(time_ms/ 1000 )::numeric(16, 3) AS time_sec
FROM order_timings;
Run a Macro from the Terminal
Define in your macros directory. Ex)macros/create_schema_macro.sql:
{% macro create_schema(schema_name) %}
    CREATE SCHEMA IF NOT EXISTS {{ schema_name }};
{% endmacro %}

In Termial:

dbt run-operation create_schema --args '{"schema_name": "my_new_schema"}'
Define a Generic Test
Generic Tests used to be defined in the macros directory. It is now recommended to write your Generic Tests in the tests/generic directory.

{% test over_10000(model, column_name) %}
  SELECT {{column_name}} 
  FROM {{ model }}   
  WHERE {{column_name}} > 10000     
{% endtest %}
Use a Generic test
In models/schema.yml add the generic test to the model and column you wish to test.
version: 2

models:
  - name: my_model
    columns:
      - name: column_to_test
        tests:
          - over_10000
          - not_null
Define a Singular Test
Write your dbt Singular tests in the tests directory and give it a descriptive name. Ex) test/test_suspicious_refunds.sql
    
SELECT order_id, 
SUM(CASE
    WHEN amount < 0 THEN amount 
    ELSE 0 
    END) as total_refunded_amount,       
COUNT(CASE 
     WHEN amount < 0 THEN 1 
     END) as number_of_refunds  
FROM {{ ref('my_model') }}  
GROUP BY 1   
HAVING number_of_refunds > 5

dbt Jinja: Filters (aka Methods)

Fine-tune your dbt data models with these transformation and formatting utilities.

String Manipulation
Lower
{{ "DATACOVES" | lower }} => "datacoves"
Upper
{{ "datacoves" | upper }} => "DATACOVES"
Default
{{ variable_name | default("Default Value") }}    
If value exists => "Sample Value"
If value does not exist => "Default Value"
Trim
{{ "Datacoves   " | trim }} => "Datacoves"  
Replace
{{ "Datacoves" | replace("v", "d") }} => "Datacodes" 
Length
{{ "Datacoves" | length }} => 9
Capitalize
{{ "datacoves" | capitalize }} => "Datacoves"  
Title
{{ "datacoves managed platform" | capitalize }}
  => "Datacoves managed platform”
Repeat a String
{{ print('-' * 20) }}
Substring
{{ "Datacoves"[0:4] }} => "Data"
Split
{{ "Data coves".split(' ') }} => ["Data", "coves"]  

Number Manipulation
Int
{{ "20" | int }} => 20 
Float
{{ 20 | float }} => 20.0 
Rounding to Nearest Whole Number
{{ 20.1434 | round }} => 21
Rounding to a Specified Decimal Place
{{ 20.1434 | round(2) }} => 20.14
Rounding Down (Floor Method)
{{ 20.5 | round(method='floor') }} => 20 
Rounding Up (Ceil Method)
{{ 20.5 | round(method='ceil') }} => 21

Please contact us with any errors or suggestions.

dbt alternatives
5 mins read

dbt (data build tool) is a powerful data transformation tool that allows data analysts and engineers to transform data in their warehouse more effectively. It enables users to write modular SQL queries, which it then runs on top of the data warehouse; this helps to streamline the analytics engineering workflow by leveraging the power of SQL. In addition to this, dbt incorporates principles of software engineering, like modularity, documentation and version control.

dbt Core vs dbt Cloud

Before we jump into the list of dbt alternatives it is important to distinguish dbt Core from dbt Cloud. The primary difference between dbt Core and dbt Cloud lies in their execution environments and additional features. dbt Core is an open-source package that users can run on their local systems or orchestrate with their own scheduling systems. It is great for developers comfortable with command-line tools and custom setup environments. On the other hand, dbt Cloud provides a hosted service with dbt core as its base. It offers a web-based interface that includes automated job scheduling, an integrated IDE, and collaboration features. It offers a simplified platform for those less familiar with command-line operations and those with less complex platform requirements.

You may be searching for alternatives to dbt due to preference for simplified platform management, flexibility to handle your organization’s complexity, or other specific enterprise needs. Rest assured because this article explores ten notable alternatives that cater to a variety of data transformation requirements.

Below is a quick list of the dbt alternatives we will be covering in this article:

We have organized these dbt alternatives into 3 groups: dbt Cloud alternatives, code based dbt alternatives , and GUI based dbt alternatives.

dbt Cloud Alternatives

dbt Cloud is a tool that dbt Labs provides, there are a few things to consider:

  • Flexibility, may be hindered by the inability to extend the dbt Cloud IDE with Python libraries or VS Code extensions
  • Handlining enterprise complexity of an end-to-end ELT pipeline will require a full-fledged orchestration tool
  • Costs can be higher than some of the alternatives below, especially at enterprise scale
  • Data security and compliance may require VPC deployment which is not available in dbt Cloud
  • Some features of dbt Cloud are not open source increasing vendor lock-in

Although dbt Cloud can help teams get going quickly with dbt, it is important to have a clear understanding of the long-term vision for your data platform and get a clear understanding of the total cost of ownership. You may be reading this article because you are still interested in implementing dbt but want to know what your options are other than dbt Clould.

Datacoves

Datacoves is tailored specifically as a seamless alternative to dbt Cloud. The platform integrates directly with existing cloud data warehouses, provides a user-friendly interface that simplifies the management and orchestration of data transformation workflows with Airflow, and provides a preconfigured VS Code IDE experience. It also offers robust scheduling and automation with managed Airflow, enabling data transformations with dbt to be executed based on specific business requirements.

Benefits:

Flexibility and Customization: Datacoves allows customization such as enabling  VSCode extensions or adding any Python library. This flexibility is needed when adapting to dynamic business environments and evolving data strategies, without vendor lock-in.

Handling Enterprise Complexity: Datacoves is equipped with managed Airflow, providing a full-fledged orchestration tool necessary for managing complex end-to-end ELT pipelines. This ensures robust data transformation workflows tailored to specific business requirements. Additionally, Datacoves does not just support the T (transformations) in the ELT pipeline, the platform spans across the pipeline by helping the user tie all the pieces together. From initial data load to post-transformation operations such as pushing data to marketing automation platforms.

Cost Efficiency: Datacoves optimizes data processing and reduces operational costs associated with data management as well as the need for multiple SaaS contracts. Its pricing model is designed to scale efficiently.

Data Security and Compliance: Datacoves is the only commercial managed dbt data platform that supports VPC deployment in addition to SaaS, offering enhanced data security and compliance options. This ensures that sensitive data is handled within a secure environment, adhering to enterprise security standards. A VPC deployment is advantageous for some enterprises because it helps reduce the red tape while still maintaining optimal security.

Open Source and Reduced Vendor Lock-In: Datacoves bundles a range of open-source tools, minimizing the risk of vendor lock-in associated with proprietary features. This approach ensures that organizations have the flexibility to switch tools without being tied to a single vendor.

Do-it-Yourself dbt Core

It is worth mentioning that that because dbt Core is open source a DIY approach is always an option. However, opting for a DIY solution requires careful consideration of several factors. Key among these is assessing team resources, as successful implementation and ongoing maintenance of dbt Core necessitate a certain level of technical expertise. Additionally, time to production is an important factor; setting up a DIY dbt Core environment and adapting it to your organization’s processes can be time-consuming.  

Finally, maintainability is essential- ensuring that the dbt setup continues to meet organizational needs over time requires regular updates and adjustments. While a DIY approach with dbt Core can offer customization and control, it demands significant commitment and resources, which may not be feasible for all organizations.

Benefits:

This is a very flexible approach because it will be made in-house and with all the organization’s needs in mind but requires additional time to implement and increases the total cost of long-term ownership.

dbt alternatives - Code based ETL tools

For organizations seeking a code-based data transformation alternative to dbt, there are two contenders they may want to consider.

SQLMesh

SQLMesh is an open-source framework that allows for SQL or python-based data transformations. Their workflow provides column level visibility to the impact of changes to downstream models. This helps developers remediate breaking changes. SQLMesh creates virtual data environments that also eliminate the need to calculate data changes more than once. Finally, teams can preview data changes before they are applied to production.

Benefits:

SQLMesh allows developers to create accurate and efficient pipelines with SQL. This tool integrates well with tools you are using today such as Snowflake, and Airflow. SQLMesh also optimizes cost savings by reusing tables and minimizing computation.

Dataform

Dataform enables data teams to manage all data operations in BigQuery. These operations include creating table definitions, configuring dependencies, adding column descriptions, and configuring data quality assertions. It also provides version control and integrates with GitLab or GitHub.

Benefits:

Dataform is a great option for those using BigQuery because it fosters collaboration among data teams with strong version control and development practices directly integrated into the workflow. Since it keeps you in BigQuery, it also reduces context switching and centralizes data models in the warehouse, improving efficiency.  

AWS Glue

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. It automates the provisioning of ETL code. It is worth noting that Amazon Glue offers GUI elements (like Glue Studio).

Benefits:

AWS Glue provides flexible support for various pipelines such as ETL, ELT, batch and more, all without a vendor lock-in. It also scales on demand, offering a pay-as-you-go billing. Lastly, this all-in-one platform has tools to support all data users from the most technical engineers to the non-technical business users.

dbt alternatives - Graphical ETL tools

While experience has taught us that there is no substitute for a code-based data transformation solution. Some organizations may opt for a graphical user interface (GUI) tool. These tools are designed with visual interfaces that allow users to drag and drop components to build data integration and transformation workflows. Ideal for users who may be intimidated by a code editor like VS Code, graphical ETL tools may simplify data processes in the short term.  

Matillion

Matillion is a cloud-based data integration platform that allows organizations to build and manage data pipelines and create no-code data transformations at scale. The platform is designed to be user-friendly, offering a graphical interface where users can build data transformation workflows visually.

Benefits:

Matillion simplifies the ETL process with a drag-and-drop interface, making it accessible for users without deep coding knowledge. It also supports major cloud data warehouses like Amazon Redshift, Google BigQuery, and Snowflake, enhancing scalability and performance.

Informatica

Informatica offers extensive data integration capabilities including ETL, hundreds of no code connectors cloud connectors, data masking, data quality, and data replication. It also uses a metadata-driven approach for data integration. In addition, it was built with performance, reliability, and security in mind to protect your valuable data.

Benefits:

Informatica enhances enterprise scalability and supports complex data management operations across various data types and sources. Informatica offers several low-code and no-code features across its various products, particularly in its cloud services and integration tools. These features are designed to make it easier for users who may not have deep technical expertise to perform complex data management tasks.

Alteryx

Alteryx allows you to automate your analytics at scale. It combines data blending, advanced analytics, and data visualization in one platform. It offers tools for predictive analytics and spatial data analysis.  

Benefits:

Alteryx enables users to perform complex data analytics with AI. It also improves efficiency by allowing data preparation, analysis, and reporting to be done within a single tool. It can be deployed on-prem or in the cloud and is scalable to meet enterprise needs.

Azure Data Factory

Azure Data Factory is a fully managed, serverless data integration service that integrates with various Azure services for data storage and data analytics. It provides a visual interface for data integration workflows which allows you to prepare data, construct ETL and ELT processes, and orchestrate and monitor pipelines code-free.

Benefits:

Azure Data Factory can be beneficial for users utilizing various Azure services because it allows seamless integration with other Microsoft products, which is ideal for businesses already invested in the Microsoft ecosystem. It also supports a pay-as-you-go model.

Talend

Talend offers an end-to-end modern data management platform with real-time or batch data integration as well as a rich suite of tools for data quality, governance, and metadata management. Talend Data Fabric combines data integration, data quality, and data governance into a single, low-code platform.

Benefits:

Talend can enhance data quality and reliability with built-in tools for data cleansing and validation. Talend is a cloud-independent solution and supports cloud, multi-cloud, hybrid, or on-premises environments.

SSIS (SQL Server Integration Services)

SQL Server Integration Services are a part of Microsoft SQL Server, providing a platform for building enterprise-level data integration and data transformations solutions. With this tool you can extract and transform data from a wide variety of sources such as XML data files, flat files, and relational data sources, and then load the data into one or more destinations. It Includes graphical tools and wizards for building and debugging packages.

Benefits:

SQL Server Integration Services are ideal for organizations heavily invested in SQL Server environments. They offer extensive support and integration capabilities with other Microsoft services and products.

Conclusion

While we believe that code is the best option to express the complex logic needed for data pipelines, the dbt alternatives we covered above offer a range of features and benefits that cater to different organizational needs. Tools like Matillion, Informatica, and Alteryx provide graphical interfaces for managing ETL processes, while SQLMesh, and Dataform offer code-based approaches to SQL and Python based data transformation.  

For those specifically looking for a dbt Cloud alternative, Datacoves stands out as a tailored, flexible solution designed to integrate seamlessly with modern data workflows, ensuring efficiency and scalability.

Get our free ebook dbt Cloud vs dbt Core

Get the PDF

Seamless data analytics with managed dbt and managed Airflow

Don’t let platform limitations or maintenance overhead hold you back.

Book a Demo