Mayra Peña | Datacoves

Ultimate dbt Jinja Cheat Sheet

5 mins read

Jinja templating in dbt offers flexibility and expressiveness that can significantly improve SQL code organization and reusability. There is a learning curve, but this cheat sheet is designed to be a quick reference for data practitioners, helping to streamline the development process and reduce common pitfalls.

Whether you're troubleshooting a tricky macro or just brushing up on syntax, bookmark this page. Trust us, it will come in handy and help you unlock the full potential of Jinja in your dbt projects.

If you find this cheat sheet useful, be sure to check out our Ultimate dbt Jinja Functions Cheat Sheet. It covers the specialized Jinja functions created by dbt, designed to enhance versatility and expedite workflows.

dbt Jinja: Basic syntax

This is the foundational syntax of Jinja, from how to comment to the difference between statements and expressions.

Basic Syntax
Comments	`{# Example Comment #}`
Statements	`{% … %} e.g.: for, if`
Expressions	`{{ … }} e.g.: ref(), source(),`

dbt Jinja: Variable assignment

Define and assign variables in different data types such as strings, lists, and dictionaries.

Variable Assignment
Strings	`{% set my_string = "example" %}`
Lists	`{% set my_list = ["apple", "lemon"] %}`
Dictionaries	`{% set my_dict = {"fruit_1": "apple", "fruit_2": "lemon"} %}`

‍

dbt Jinja: White space control

Jinja allows fine-grained control over white spaces in compiled output. Understand how to strategically strip or maintain spaces.

White Space Control
Strip Before	`{%- ... %}`
Strip After	`{% ... -%}`
String Before and After	`{%- ... -%}`

dbt Jinja: Control flow

In dbt, conditional structures guide the flow of transformations. Grasp how to integrate these structures seamlessly.

Control Flow
If/elif/else/endif	`{%- if target.name == 'dev' -%} {{ some code }} {%- elif target.name == 'prod' -%} {{ some other code }} {%- else -%} {{ some other code }} {%- endif -%}`

dbt Jinja: Looping

Discover how to iterate over lists and dictionaries. Understand simple loop syntax or accessing loop properties.

Looping
Loop Syntax	`{%- for item in my_iterable -%} --Do something with item {{ item }} {%- endfor -%}`
loop.last	This boolean is False unless the current iteration is the last iteration. `{% for item in list %} {% if loop.last %} --This is the last item {{ item }} {% endif %} {% endfor %}`
loop.first	A boolean that is True if the current iteration is the first iteration, otherwise False. `{% for item in list %} {% if loop.first %} --first item {{ item }} {% endif %} {% endfor %}`
loop.index	An integer representing the current iteration of the loop (1-indexed). So, the first iteration would have loop.index of 1, the second would be 2, and so on. `{% for item in list %} --This is item number {{ loop.index }} {% endfor %}`
Looping a List	`{% set rating_categories = ["quality_rating", "design_rating", "usability_rating"] %}` `SELECT product_id, {%- for col_name in rating_categories -%} AVG({{ col_name }}) as {{ column_name }}_average {%- if not loop.last -%} , {%- endif -%} {%- endfor -%} FROM product_reviews GROUP BY 1` Compiled code `SELECT product_id, AVG(quality_rating) as quality_rating_average, AVG(design_rating) as design_rating_average, AVG(usability_rating) as usability_rating_average FROM product_reviews GROUP BY 1`
Looping a Dictionary	`{% set delivery_type_dict = {"a": "digital_download", "b": "physical_copy"} %}` `SELECT order_id, {%- for type, column_name in delivery_type_dict.items() -%} COUNT(CASE WHEN delivery_method = '{{ type }}' THEN order_id END) as {{ column_name }}_count {%- if not loop.last -%} , {%- endif -%} {%- endfor -%} FROM order_deliveries GROUP BY 1` `SELECT order_id, COUNT(CASE WHEN delivery_method = 'a' THEN order_id END) as digital_download_count, COUNT(CASE WHEN delivery_method = 'b' THEN order_id END) as physical_copy_count FROM order_deliveries GROUP BY 1`

dbt Jinja: Operators

These logical and comparison operators come in handy, especially when defining tests or setting up configurations in dbt.

Logic Operators
and	`{% if condition1 and condition2 %}`
or	`{% if condition1 or condition2 %}`
not	`{{ not condition1 }}`

‍

Comparison Operators
Equal To	`{% if 1 == 2 %}`
Not Equal To	`{% if 1 != 2 %}`
Greater Than	`{% if 1 > 2 %}`
Less Than	`{% if 1 < 2 %}`
Greater Than or Equal to	`{% if 1 >= 2 %}`
Less Than or Equal To	`{% if 1 <= 2 %}`

dbt Jinja: Variable tests

Within dbt, you may need to validate if a variable is defined or a if a value is odd or even. These Jinja Variable tests allow you to validate with ease.

Jinja Variable Tests
Is Defined	`{% if my_variable is defined %} -- Handle conditions when variable exists {% endif %}`
Is None	`{% if my_variable is none %} -- Handle absence of my_variable {% endif %}`
Is Even	`{% if my_variable is even %} -- Handle when my_variable is even {% endif %}`
Is Odd	`{% if my_variable is odd %} -- Handle when my_variable is odd {% endif %}`
Is a String	`{% if my_variable is string %} -- Handle when my_variable is a string {% endif %}`
Is a Number	`{% if my_variable is number %} -- Handle when my_variable is a number {% endif %}`

dbt Jinja: Creating macros & tests

Macros are the backbone of advanced dbt workflows. Review how to craft these reusable code snippets and also how to enforce data quality with tests.

Creating Macros & Tests
Define a Macro	Write your macros in your project's macros directory. `{% macro ms_to_sec(col_name, precision=3) %} ( {{ col_name }} / 1000 )::numeric(16, {{ precision }}) {% endmacro %}`
Use a Macro from a Model	In a model: `SELECT order_id, {{ ms_to_sec(col_name=time_ms, precision=3) }} as time_sec FROM order_timings;` Compiled code: `SELECT order_id, (time_ms/ 1000 )::numeric(16, 3) AS time_sec FROM order_timings;`
Run a Macro from the Terminal	Define in your macros directory. Ex)macros/create_schema_macro.sql: `{% macro create_schema(schema_name) %} CREATE SCHEMA IF NOT EXISTS {{ schema_name }}; {% endmacro %}` In Termial: dbt run-operation create_schema --args '{"schema_name": "my_new_schema"}'
Define a Generic Test	Generic Tests used to be defined in the macros directory. It is now recommended to write your Generic Tests in the tests/generic directory. `{% test over_10000(model, column_name) %} SELECT {{column_name}} FROM {{ model }} WHERE {{column_name}} > 10000 {% endtest %}`
Use a Generic test	In models/schema.yml add the generic test to the model and column you wish to test. `version: 2 models: - name: my_model columns: - name: column_to_test tests: - over_10000 - not_null`
Define a Singular Test	Write your dbt Singular tests in the tests directory and give it a descriptive name. Ex) test/test_suspicious_refunds.sql `SELECT order_id, SUM(CASE WHEN amount < 0 THEN amount ELSE 0 END) as total_refunded_amount, COUNT(CASE WHEN amount < 0 THEN 1 END) as number_of_refunds FROM {{ ref('my_model') }} GROUP BY 1 HAVING number_of_refunds > 5`

dbt Jinja: Filters (aka Methods)

Fine-tune your dbt data models with these transformation and formatting utilities.

String Manipulation
Lower	`{{ "DATACOVES" \| lower }} => "datacoves"`
Upper	`{{ "datacoves" \| upper }} => "DATACOVES"`
Default	`{{ variable_name \| default("Default Value") }} If value exists => "Sample Value" If value does not exist => "Default Value"`
Trim	`{{ "Datacoves " \| trim }} => "Datacoves"`
Replace	`{{ "Datacoves" \| replace("v", "d") }} => "Datacodes"`
Length	`{{ "Datacoves" \| length }} => 9`
Capitalize	`{{ "datacoves" \| capitalize }} => "Datacoves"`
Title	`{{ "datacoves managed platform" \| capitalize }} => "Datacoves managed platform”`
Repeat a String	`{{ print('-' * 20) }}`
Substring	`{{ "Datacoves"[0:4] }} => "Data"`
Split	`{{ "Data coves".split(' ') }} => ["Data", "coves"]`

‍

Number Manipulation
Int	`{{ "20" \| int }} => 20`
Float	`{{ 20 \| float }} => 20.0`
Rounding to Nearest Whole Number	`{{ 20.1434 \| round }} => 21`
Rounding to a Specified Decimal Place	`{{ 20.1434 \| round(2) }} => 20.14`
Rounding Down (Floor Method)	`{{ 20.5 \| round(method='floor') }} => 20`
Rounding Up (Ceil Method)	`{{ 20.5 \| round(method='ceil') }} => 21`

Please contact us with any errors or suggestions.

Dbt & Airflow

10 dbt Alternatives

5 mins read

dbt (data build tool) is a powerful data transformation tool that allows data analysts and engineers to transform data in their warehouse more effectively. It enables users to write modular SQL queries, which it then runs on top of the data warehouse; this helps to streamline the analytics engineering workflow by leveraging the power of SQL. In addition to this, dbt incorporates principles of software engineering, like modularity, documentation and version control.

dbt Core vs dbt Cloud

Before we jump into the list of dbt alternatives it is important to distinguish dbt Core from dbt Cloud. The primary difference between dbt Core and dbt Cloud lies in their execution environments and additional features. dbt Core is an open-source package that users can run on their local systems or orchestrate with their own scheduling systems. It is great for developers comfortable with command-line tools and custom setup environments. On the other hand, dbt Cloud provides a hosted service with dbt core as its base. It offers a web-based interface that includes automated job scheduling, an integrated IDE, and collaboration features. It offers a simplified platform for those less familiar with command-line operations and those with less complex platform requirements.

You may be searching for alternatives to dbt due to preference for simplified platform management, flexibility to handle your organization’s complexity, or other specific enterprise needs. Rest assured because this article explores ten notable alternatives that cater to a variety of data transformation requirements.

Below is a quick list of the dbt alternatives we will be covering in this article:

We have organized these dbt alternatives into 3 groups: dbt Cloud alternatives, code based dbt alternatives , and GUI based dbt alternatives.

dbt Cloud Alternatives

dbt Cloud is a tool that dbt Labs provides, there are a few things to consider:

Flexibility, may be hindered by the inability to extend the dbt Cloud IDE with Python libraries or VS Code extensions
Handlining enterprise complexity of an end-to-end ELT pipeline will require a full-fledged orchestration tool
Costs can be higher than some of the alternatives below, especially at enterprise scale
Data security and compliance may require VPC deployment which is not available in dbt Cloud
Some features of dbt Cloud are not open source increasing vendor lock-in

‍

Although dbt Cloud can help teams get going quickly with dbt, it is important to have a clear understanding of the long-term vision for your data platform and get a clear understanding of the total cost of ownership. You may be reading this article because you are still interested in implementing dbt but want to know what your options are other than dbt Clould.

Datacoves

Datacoves is tailored specifically as a seamless alternative to dbt Cloud. The platform integrates directly with existing cloud data warehouses, provides a user-friendly interface that simplifies the management and orchestration of data transformation workflows with Airflow, and provides a preconfigured VS Code IDE experience. It also offers robust scheduling and automation with managed Airflow, enabling data transformations with dbt to be executed based on specific business requirements.

Benefits:

Flexibility and Customization: Datacoves allows customization such as enabling VSCode extensions or adding any Python library. This flexibility is needed when adapting to dynamic business environments and evolving data strategies, without vendor lock-in.

Handling Enterprise Complexity: Datacoves is equipped with managed Airflow, providing a full-fledged orchestration tool necessary for managing complex end-to-end ELT pipelines. This ensures robust data transformation workflows tailored to specific business requirements. Additionally, Datacoves does not just support the T (transformations) in the ELT pipeline, the platform spans across the pipeline by helping the user tie all the pieces together. From initial data load to post-transformation operations such as pushing data to marketing automation platforms.

Cost Efficiency: Datacoves optimizes data processing and reduces operational costs associated with data management as well as the need for multiple SaaS contracts. Its pricing model is designed to scale efficiently.

Data Security and Compliance: Datacoves is the only commercial managed dbt data platform that supports VPC deployment in addition to SaaS, offering enhanced data security and compliance options. This ensures that sensitive data is handled within a secure environment, adhering to enterprise security standards. A VPC deployment is advantageous for some enterprises because it helps reduce the red tape while still maintaining optimal security.

Open Source and Reduced Vendor Lock-In: Datacoves bundles a range of open-source tools, minimizing the risk of vendor lock-in associated with proprietary features. This approach ensures that organizations have the flexibility to switch tools without being tied to a single vendor.

Do-it-Yourself dbt Core

It is worth mentioning that that because dbt Core is open source a DIY approach is always an option. However, opting for a DIY solution requires careful consideration of several factors. Key among these is assessing team resources, as successful implementation and ongoing maintenance of dbt Core necessitate a certain level of technical expertise. Additionally, time to production is an important factor; setting up a DIY dbt Core environment and adapting it to your organization’s processes can be time-consuming.

Finally, maintainability is essential- ensuring that the dbt setup continues to meet organizational needs over time requires regular updates and adjustments. While a DIY approach with dbt Core can offer customization and control, it demands significant commitment and resources, which may not be feasible for all organizations.

Benefits:

This is a very flexible approach because it will be made in-house and with all the organization’s needs in mind but requires additional time to implement and increases the total cost of long-term ownership.

dbt alternatives - Code based ETL tools

For organizations seeking a code-based data transformation alternative to dbt, there are two contenders they may want to consider.

SQLMesh

SQLMesh is an open-source framework that allows for SQL or python-based data transformations. Their workflow provides column level visibility to the impact of changes to downstream models. This helps developers remediate breaking changes. SQLMesh creates virtual data environments that also eliminate the need to calculate data changes more than once. Finally, teams can preview data changes before they are applied to production.

Benefits:

SQLMesh allows developers to create accurate and efficient pipelines with SQL. This tool integrates well with tools you are using today such as Snowflake, and Airflow. SQLMesh also optimizes cost savings by reusing tables and minimizing computation.

Dataform

Dataform enables data teams to manage all data operations in BigQuery. These operations include creating table definitions, configuring dependencies, adding column descriptions, and configuring data quality assertions. It also provides version control and integrates with GitLab or GitHub.

Benefits:

Dataform is a great option for those using BigQuery because it fosters collaboration among data teams with strong version control and development practices directly integrated into the workflow. Since it keeps you in BigQuery, it also reduces context switching and centralizes data models in the warehouse, improving efficiency.

AWS Glue

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. It automates the provisioning of ETL code. It is worth noting that Amazon Glue offers GUI elements (like Glue Studio).

Benefits:

AWS Glue provides flexible support for various pipelines such as ETL, ELT, batch and more, all without a vendor lock-in. It also scales on demand, offering a pay-as-you-go billing. Lastly, this all-in-one platform has tools to support all data users from the most technical engineers to the non-technical business users.

dbt alternatives - Graphical ETL tools

While experience has taught us that there is no substitute for a code-based data transformation solution. Some organizations may opt for a graphical user interface (GUI) tool. These tools are designed with visual interfaces that allow users to drag and drop components to build data integration and transformation workflows. Ideal for users who may be intimidated by a code editor like VS Code, graphical ETL tools may simplify data processes in the short term.

Matillion

Matillion is a cloud-based data integration platform that allows organizations to build and manage data pipelines and create no-code data transformations at scale. The platform is designed to be user-friendly, offering a graphical interface where users can build data transformation workflows visually.

Benefits:

Matillion simplifies the ETL process with a drag-and-drop interface, making it accessible for users without deep coding knowledge. It also supports major cloud data warehouses like Amazon Redshift, Google BigQuery, and Snowflake, enhancing scalability and performance.

Informatica

Informatica offers extensive data integration capabilities including ETL, hundreds of no code connectors cloud connectors, data masking, data quality, and data replication. It also uses a metadata-driven approach for data integration. In addition, it was built with performance, reliability, and security in mind to protect your valuable data.

Benefits:

Informatica enhances enterprise scalability and supports complex data management operations across various data types and sources. Informatica offers several low-code and no-code features across its various products, particularly in its cloud services and integration tools. These features are designed to make it easier for users who may not have deep technical expertise to perform complex data management tasks.

Alteryx

Alteryx allows you to automate your analytics at scale. It combines data blending, advanced analytics, and data visualization in one platform. It offers tools for predictive analytics and spatial data analysis.

Benefits:

Alteryx enables users to perform complex data analytics with AI. It also improves efficiency by allowing data preparation, analysis, and reporting to be done within a single tool. It can be deployed on-prem or in the cloud and is scalable to meet enterprise needs.

Azure Data Factory

Azure Data Factory is a fully managed, serverless data integration service that integrates with various Azure services for data storage and data analytics. It provides a visual interface for data integration workflows which allows you to prepare data, construct ETL and ELT processes, and orchestrate and monitor pipelines code-free.

Benefits:

Azure Data Factory can be beneficial for users utilizing various Azure services because it allows seamless integration with other Microsoft products, which is ideal for businesses already invested in the Microsoft ecosystem. It also supports a pay-as-you-go model.

Talend

Talend offers an end-to-end modern data management platform with real-time or batch data integration as well as a rich suite of tools for data quality, governance, and metadata management. Talend Data Fabric combines data integration, data quality, and data governance into a single, low-code platform.

Benefits:

Talend can enhance data quality and reliability with built-in tools for data cleansing and validation. Talend is a cloud-independent solution and supports cloud, multi-cloud, hybrid, or on-premises environments.

SSIS (SQL Server Integration Services)

SQL Server Integration Services are a part of Microsoft SQL Server, providing a platform for building enterprise-level data integration and data transformations solutions. With this tool you can extract and transform data from a wide variety of sources such as XML data files, flat files, and relational data sources, and then load the data into one or more destinations. It Includes graphical tools and wizards for building and debugging packages.

Benefits:

SQL Server Integration Services are ideal for organizations heavily invested in SQL Server environments. They offer extensive support and integration capabilities with other Microsoft services and products.

Conclusion

While we believe that code is the best option to express the complex logic needed for data pipelines, the dbt alternatives we covered above offer a range of features and benefits that cater to different organizational needs. Tools like Matillion, Informatica, and Alteryx provide graphical interfaces for managing ETL processes, while SQLMesh, and Dataform offer code-based approaches to SQL and Python based data transformation.

For those specifically looking for a dbt Cloud alternative, Datacoves stands out as a tailored, flexible solution designed to integrate seamlessly with modern data workflows, ensuring efficiency and scalability.

Tooling

10 Open Source SQL Databases

5 mins read

SQL databases are great for organizing, storing, and retrieving structured data essential to modern business operations. These databases use Structured Query Language (SQL), a gold standard tool for managing and manipulating data, which is universally recognized for its reliability and robustness in handling complex queries and vast datasets.

SQL is so instrumental to database management that databases are often categorized based on their use of SQL. This has led to the distinction between SQL databases, which use Structured Query Language for managing data, and NoSQL databases, which do not rely on SQL and are designed for handling unstructured data and different data storage models. If you are looking to compare SQL databases or just want to deepen your understanding of these essential tools, this article is just for you.

What is an Open Source Database?

Open source databases are software systems whose source code is publicly available for anyone to view, modify, and enhance. This article covers strictly open source SQL databases. Why? Because we believe that they bring additional advantages that are reshaping the data management space. Unlike proprietary databases that can be expensive and restrictive, open source databases are developed through collaboration and innovation at their core. This not only eliminates licensing fees but also creates a rich environment of community-driven enhancements. Contributors from around the globe work to refine and evolve these databases, ensuring they are equipped to meet the evolving demands of the data landscape.

Why Use Open Source Databases?

Cost-effectiveness: Most open source databases are free to use, which can significantly reduce the total cost of ownership.

Flexibility and Customization: Users can modify the database software to meet their specific needs, a benefit not always available with proprietary software.

Community Support: Robust communities contribute to the development and security of these databases, often releasing updates and security patches faster than traditional software vendors.

OLTP vs OLAP

When selecting a database, it is important to determine your primary use case. Are you frequently creating, updating, or deleting data? Or do you need to analyze large volumes of archived data that doesn't change often? The answer should guide the type of database system you choose to implement.

In this article we will be touching on OLTP and OLAP open source SQL databases. These databases are structured in different ways depending on the action they wish to prioritize analytics, transactions, or a hybrid of the two.

What is OLTP?

OLTP or Online Transaction Processing databases are designed to manage and handle high volumes of small transactions such as inserting, updating, and/or deleting small amounts of data in a database. OLTP databases can handle real-time transactional tasks due to their emphasis on speed and reliability. The design of OLTP databases is highly normalized to reduce redundancy and optimizes update/insert/delete performance. OLTP databases can be used for analytics but this is not recommended since better databases suited for analytics exist.

Characteristics of OLTP:

Handles large numbers of transactions by many users.
Operations are typically simple (e.g., updating a record or retrieving specific record details).
Focus on quick query processing and maintaining data integrity in multi-access environments.
Data is highly normalized.

When to use OLTP?

Use OLTP if you are developing applications that require fast, reliable, and secure transaction processing. Common use cases include but are not limited to:

E-commerce: Order placement, payment processing, customer profile management, and shopping cart updates.

Banking: Account transactions, loan processing, ATM operations, and fraud detection.

Customer Relationship Management (CRM): Tracking customer interactions, updating sales pipelines, managing customer support tickets, and monitoring marketing campaigns.

What is OLAP?

OLAP or Online Analytical Processing databases are designed to perform complex analyses and queries on large volumes of data. They are optimized for read-heavy scenarios where queries are often complicated and involve aggregations such as sums and averages across many datasets. OLAP databases are typically denormalized, which improves query performance but come with the added expense of storage space and slower update speeds.

Characteristics of OLAP:

Designed for analysis and reporting functions.
Queries are complex and involve large volumes of data.
Focus on maximizing query speed across large datasets.
Data may be denormalized to expedite query processing.

When to use OLAP?

Use OLAP if you need to perform complex analysis on large datasets to gather insights and support decision making. Common use cases include but are not limited to:

Retail Sales Data Analysis: A retail chain consolidates nationwide sales data to analyze trends, product performance, and customer preferences.

Corporate Performance Monitoring: A multinational uses dashboards to track financial, human resources, and operational metrics for strategic decision-making.

Financial Analysis and Risk Management: A bank leverages an OLAP system for financial forecasting and risk analysis using complex data-driven calculations.

In practice, many businesses will use both types of systems: OLTP systems to handle day-to-day transactions and OLAP systems to analyze data accumulated from these transactions for business intelligence and reporting purposes.

Now that we are well versed in OLTP vs OLAP, let's dive into our open source databases!

Open Source OLTP Databases

PostgreSQL

A row-oriented database, often considered the world’s most advanced open source database. PostgreSQL offers extensive features designed to handle a range of workloads from single machines to data warehouses or web services with many concurrent users.

Best Uses: Enterprise applications, complex queries, handling large volumes of data.

SQLite

SQLite is a popular choice for embedded database applications, being a self-contained, high-reliability, and full-featured SQL database engine. This database is a File-based database which means that they store data in a file (or set of files) on disk, rather than requiring a server-based backend. This approach has several key characteristics and advantages such as being lightweight, portable, easy to use, and self-contained.

Best Uses: Mobile applications, small to medium-sized websites, and desktop applications.

MariaDB

A columnar database and offshoot of MySQL. MariaDB was created by the original developers of MySQL after concerns over its acquisition by Oracle. It is widely respected for its performance and robustness.

Best Uses: Web-based applications, cloud environments, or as a replacement for MySQL.

Firebird

Firebird is a flexible relational database offering many ANSI SQL standard features that run on Linux, Windows, and a variety of Unix platforms. This database can handle a hybrid approach of OLTP and OLAP due to its multi-generational architecture and because readers do not block writers when accessing the same data.

Best Uses: Small to medium enterprise applications, particularly where complex, customizable database systems are required.

Open Source OLAP Databases

ClickHouse

Known for its speed, ClickHouse is an open-source column-oriented, File-based database management system that is great at real-time query processing over large datasets. As mentioned earlier in the article, File-based databases bring many benefits They make use of data compression, disk storage of data, parallel processing on multiple cores, distributed processing on multiple servers and more.

Best Uses: Real-time analytics and managing large volumes of data.

DuckDB

Similar to SQLite, DuckDB is an embedded file-based database, however, DuckDB is a column-oriented database that is designed to execute analytical SQL queries fast and efficiently. This database has no dependencies making it a simple, efficient, and portable database. Since it is file-based this means DuckDB runs embedded within the host process, which allows for high-speed data transfer during analytics.

Best Uses: Analytical applications that require fast, in-process SQL querying capabilities.

StarRocks

StarRocks is a performance-oriented, columnar distributed data warehouse designed to handle real-time analytics. StarRocks also supports hybrid row-column storage. It is known for its blazing-fast massively parallel processing (MPP) abilities. Data can be ingested at a high speed and updated and deleted in real time making it perfect for real-time analytics on fresh data.

Best Uses: Real-time analytical processing on large-scale datasets.

Doris

Doris is an MPP-based, column-oriented data warehouse, aimed at providing high performance and real-time analytical processing. Doris can support highly concurrent point query scenarios and high-throughput complex analytic scenarios. Its high speed and ease of use despite working with large amounts of data make it a great option.

Best Uses: Real-time OLAP applications and scenarios demanding fast data processing and complex aggregation.

Trino

Even though Trino is not a database, but rather a query engine that allows you to query your databases, we felt it is a powerful addition to this open source list. Originally developed by Facebook and known as PrestoSQL, Trino is designed to query large data warehouses and big data systems rapidly. Since it is great for working with terabytes or petabytes of data it is an alternative to tools such as Hive or Pig. However, it can also operate on traditional relational databases and other data sources such as Cassandra. One major benefit is that Trino allows you to perform queries across different databases and data sources. This is known as query federation.

Best Uses: Distributed SQL querying for big data solutions.

Citus

While this is not a separate open source database, we felt it was a good addition to the list because Citus is an extension to PostgreSQL that that transforms your Postgres database into a distributed database. This enables it to scale horizontally.

Best Uses: Scalable PostgreSQL applications, especially those needing to handle multi-tenant applications and real-time analytics over large datasets.

Conclusion

Open source SQL databases provide a variety of options for organizations and developers seeking flexible, cost-effective solutions for data management. Whether your needs are for handling large data sets, real-time analytics, or robust enterprise applications, there is likely an open source database out there for you.

Tooling

5 Open Source Data Quality Tools – 2024 Edition

5 mins read

Data is in the spotlight as companies everywhere realize data's true potential. With big initiatives like GenAI and sophisticated data ecosystems, ensuring data quality is not just a necessity but a mandatory investment for businesses and analysts worldwide. Some people are learning the hard way that you need stable data foundations to get the results these initiatives promise.

While there are many great tools out there, the spotlight on open source tools has never been brighter. Open source software offers transparency, adaptability, and community-driven enhancements that are crucial in the rapidly evolving data landscape. This article covers 5 open source data quality tools and is current as of April 2024, so if that is something that interests you, stick around.

What is data quality?

First things first, what is data quality? There are many definitions of data quality, but data is considered high quality if it is fit for its intended uses in operations, decision-making, and planning. In other words, data quality refers to the data's accuracy, completeness, reliability, relevance, and how up-to-date it is. In the context of data-driven decision-making, high-quality data is crucial as it directly impacts the accuracy of insights and the effectiveness of decisions. Our data foundation.

High-quality data should be:

Accurate: Data that is free from errors and discrepancies.

Complete: Data that covers the necessary breadth and depth needed by the business.

Reliable: Data that has no missing elements and is consistently represented and sourced.

Relevant: Data that is applicable to the context and purposes for the business.

Current: Data that is up-to-date and timely for its purpose.

Understanding what is needed for data quality is the first step toward recognizing the importance of these tools and practices that maintain or enhance this quality.

What should you track for good data quality?

Now we know what constitutes high quality data but what do we need to monitor to ensure that our data is high quality? The good news is these metrics tend to be universal. For maintaining high data quality, several metrics and elements should be monitored regularly:

Accuracy: Ensure that your data correctly represents reality or the source from which it came.

Completeness: Check for missing values or data segments that could lead to incorrect analysis or conclusions.

Consistency: Data across different systems or platforms should match and be consistent.

Timeliness: Data should be updated and available in a timeframe that aligns with its intended use.

Validity: Data should adhere to the relevant rules, such as data formats and value ranges.

Uniqueness: No duplicates should be present unless necessary, ensuring each entry is unique.

Integrity: There should be a relationship between datasets and records that maintains data accuracy and consistency.

By tracking these metrics, organizations can set up the essential data foundation and significantly improve the trustworthiness and utility of their data. This will lead to better outcomes and insights that can support great data initiatives of the future.

Since it is essential to track these metrics, companies are on the search for the best tool to help them improve their data quality. Here is a list of open source tools that can be leveraged to improve data quality.

Why open source?

Before we jump into the tool list you may have noticed that a quick google search for this topic will give me many different lists. How is our list different? Well, we are focusing on open source tools. There are many great tools out there both paid and “free” and we put quotes around free because there is no such thing as free; there are always hidden costs (hours worked) for setup and maintenance. However, we wanted to make this open source tool list because regardless of the hidden costs we believe in the following benefits of open source tools:

Transparency: Open source tools offer complete transparency in their operations and algorithms. Users can inspect, modify, and improve the code, which enhances trust and reliability.

Community: Open source projects benefit from the collective intelligence of a global community. This not only accelerates innovation and bug resolution but also provides a large pool of knowledge and support.

Flexibility: With open source, organizations are not locked into proprietary systems, allowing them to tailor tools to their specific needs and integrate them seamlessly into their existing environments.

Cost-effectiveness: While open source doesn't always mean free, it significantly reduces costs associated with licensing fees and vendor lock-in, making cutting-edge tools accessible to everyone.

Quality and Security: Continuous contributions and scrutiny by the community mean that open source tools often meet high standards of quality and security, with issues being identified and addressed rapidly.

Our selection of open source data quality tools is grounded in rigorous open source criteria. We believe that the strength of an open source project lies not just in its ability to solve complex problems but also in its community, transparency, and commitment to ongoing improvement. When compiling this list, we considered factors such as active community engagement, frequency of updates, the quality of the documentation, and ease of contribution. This ensures that the tools recommended not only meet high standards of performance and reliability but also embody the principles that make open source software a valuable asset to the data quality landscape. So without further ado, let's jump into our list.

Open source data quality tool list

dbt Core

Self managed or in Datacoves

Primary Language: SQL / YAML

Purpose & Features: dbt core is an open source tool that allows data analysts and engineers to transform data in their data warehouses by writing dynamic SQL queries, which dbt then converts into tables and views. It also supports version control, testing, and documentation, which helps maintain data integrity and reliability.

For data quality, dbt Core has some out of the box data tests which can be extended through custom made test, or by using libraries such as dbt-expectations and elementary. Testing is easily done by configuring macros in YAML files or by writing custom SQL tests. However, integrating dbt Core into your data stack can be a big task especially when it comes to scheduling. A managed dbt Core platform such as Datacoves could be a great option for saving time and money. While dbt handles only the 'T' in ELT, Datacoves’ managed dbt Core Platform ensures that the entire ELT process is smooth and interconnected, allowing your team to concentrate on deriving insights from the data. There are other dbt alternatives on the market that can also be explored which handle the "T" in the ELT process.

Who it is for: Best for teams using SQL who want to transform data directly in the warehouse and who want to follow software development best practices including unit testing in their data pipelines.

Soda Core

Primary Language: YAML

Purpose & Features: Soda Core is the open source component that allows users to define data quality checks in code and integrate them into workflows.

Who it's for: Teams that need data quality checks integrated into their existing Python workflows or data pipelines.

Great Expectations

Primary Language: Python

Purpose & Features: This tool is a data quality platform that allows you to create data tests, documentation, and profiles automatically. It easily integrates into existing data processing pipelines to ensure data validation against expectations (unit tests). You can collaborate with nontechnical stakeholders by sharing the Data Docs. Data docs are Expectations, Validation Results, and other metadata translated into a human readable format as seen in the image below.

Who it's for: Data teams looking for a Pythonic way to enforce data quality rules and create automated data documentation.

Deequ

Primary Language: Scala (for Apache Spark)

Purpose & Features: Deequ is an open source tool by Amazon with which you can define "unit tests" (columnar or row level) for large-scale data within the Spark ecosystem. It allows for automated checks of data quality metrics such as completeness, uniqueness, and conformity. This enables data teams to find errors early before they are consumed downstream. You can use Deequ to define your assumptions about the data in unit tests to catch any data that does not meet your assumptions. This tool works on tabular data such as CSV files, databases tables, logs and flattened JSON files.

Who it's for: Data engineers and scientists working with big data in Spark (billons of rows), particularly those focused on maintaining data quality at scale.

Data Diff

Primary Language: You don’t manually write data quality tests but as you make changes to your SQL data diff will work its magic.

Purpose & Features: This tool is a little different from the rest because you're not exactly writing tests to catch data quality issues. Instead, this open source Python package by Datafold lets you do development testing by spotting the differences between tables whenever you tweak your code. It's a great way to compare what's happening in your production data against your development changes, helping you see directly how those code changes are playing out in the data.

Who it's for: Data engineers and teams who need to ensure that changes in data processing and ETL logic do not negatively affect data quality.

Choose the right open source data quality tool for your use case

The concept of "the best" for data quality tools is inherently tied to specific use cases. What might be an ideal solution for one organization could be less effective for another, depending on the unique challenges and requirements each face.

Before you dive into a tool, it's crucial to understand your organization's specific data quality challenges. Are you dealing with high volumes of data, requiring scalability? Or are your main issues related to data consistency and accuracy in a smaller, more controlled dataset? Identifying your primary use case will help you navigate through our top 10 tools and select the one that best fits your situation.

Here’s 6 tips to help you make an informed choice for your specific needs:

1. Assess Your Data Quality Needs:

Identify the primary issues you're facing with your data. Are you struggling with incomplete data, inconsistencies, outdated information, or data that's not in the right format? Understanding your main challenges will guide you toward a tool that specializes in addressing those specific problems. Once you understand your data quality challenges and objectives, match these with the strengths of the tools listed above

2. Consider Your Technical Environment:

Evaluate the technical stack you are currently using. Some data quality tools are better suited for certain environments or integrate more seamlessly with specific databases, data lakes, or processing frameworks. Choose a tool that aligns with your existing infrastructure to reduce integration headaches.

3. Evaluate Community and Support:

The strength of an open-source tool lies in its community. Look for a tool with an active community, which is evident through regular updates, vibrant forums, and extensive documentation. A strong community can provide invaluable support, from troubleshooting to best practices.

4. Check for Flexibility and Scalability:

Your data needs will evolve, so it’s important to choose a tool that is flexible and can scale with your business. Assess the tool’s ability to handle different data volumes, types, and sources. A good open-source tool should not only solve your current data quality issues but also adapt to future challenges.

5. Review Security and Compliance Features:

Data security and compliance are imperative. Be sure the tool complies with data protection regulations and offers security features to protect your data. This is especially important if you're dealing with sensitive or personal information.

6. Test Drive the Tool:

Finally, don’t hesitate to get your hands dirty. Most open-source tools are free to use, so take advantage of this by testing the tool with your data. This will give you a clear idea of the tool’s usability, effectiveness, and fit with your use case. Be sure to go into this with an open mind to get the most out of the tool.

Conclusion

In the era of generative AI and other lofty initiatives high-quality data is not just an option but a necessity, and embracing these open-source data quality tools can significantly enhance the reliability and accuracy of your data. Remember, the "best" tool is one that aligns closely with your specific use case offering the features and flexibility your team needs to effectively tackle your data quality challenges; it very well could be a combination of these tools. Whether you are in the world of SQL, Python, or any other programming language, there is a tool tailored to your needs. Consider factors such as ease of integration into your current data ecosystem, the learning curve for your team, and the level of community support available.

Innovation

Healthcare's Digital Transformation: Data Strategy Issues

5 mins read

There is no doubt about the transformative potential of big data and analytics. This is particularly true for the Life Science sector as implementing technologies and ideologies can revolutionize drug development, tailor medicines to individual needs, dramatically improve patient care and more. The data supports this, with the longest running report of Fortune 1000 CIOs by Wavestone, showing 87.9% of companies believe “investments in Data & Analytics are a Top Organizational Priority.”

Such great promise and high buy in should mean easy cultural adoption, right? Well not exactly. The 2024 report shows there has been a notable improvement; the percentage of top executives facing significant challenges in culture, people, and process/organization decreased from a staggering 90% in 2020 to 77.6% in 2024. While this reduction signifies a positive trend towards addressing these issues, the fact that more than three-quarters of leaders still encounter these problems underscores the widespread nature of these challenges. The persistently high percentage highlights the need for continued and focused efforts to overcome these barriers.

In an effort to help further lower the 77.6%, this article aims to cover the benefits of data in the Life Science sector, highlight common culture issues, and provide some solutions to the problem. If your organization is among those facing these struggles, know that you are not alone.

Benefit of data in Life Science sector

The life science sector stands on the brink of a digital transformation, powered by the strategic use of data. It's clear that its impact is far-reaching, transforming every facet from research and development to patient care and beyond. Below are some examples of what this data-driven culture can improve.

Accelerating Drug Development: One of the most significant benefits of data lies in its ability to streamline and accelerate the drug development process. By harnessing big data and advanced analytics, life science organizations can identify potential drug candidates faster than ever before. This not only reduces the time and cost associated with bringing new treatments to market but also allows for more targeted and effective therapies.
Enhancing Personalized Medicine: Personalized medicine, tailored to the individual characteristics of each patient, is becoming a reality thanks to data. By analyzing vast datasets, including genetic information, researchers can understand how different people are likely to respond to treatments. This leads to more effective and personalized care plans, improving patient outcomes and reducing the likelihood of adverse reactions.
Improving Clinical Trials: Data analytics revolutionize the way clinical trials are conducted. By leveraging data, scientists can design better trials, select suitable candidates more accurately, and monitor results in real time. This not only increases the efficiency and efficacy of trials but also enhances patient safety by identifying potential issues earlier in the process.
Driving Operational Efficiency: Beyond research and patient care, data also plays a crucial role in enhancing the operational efficiency of life science organizations. From streamlining supply chains to optimizing resource allocation, data-driven insights help companies operate more efficiently, reduce costs, and respond more quickly to market changes.
Fostering Innovation: At its core, the utilization of data fosters an environment of innovation within the life science sector. By providing a wealth of information and insights, data encourages researchers to explore new hypotheses, uncover hidden patterns, and push the boundaries of what's possible in medical science.
Informing Policy and Decision-Making: Lastly, data aids in the development of more informed policies and decisions. By analyzing trends, outcomes, and costs, policymakers and healthcare providers can make evidence-based decisions that lead to better health outcomes and more efficient use of resources.

Digital transformation in healthcare - Key challenges

As we identified earlier in this article, culture, people, and process/organization challenges are the biggest obstacles for companies to achieve digital transformation and become data-driven. Identifying specific challenges is key to developing a solution.

Resistance to Change: Cultural norms and resistance to change within organizations can impede the adoption of new data strategies, especially when there is a lack of buy-in from key stakeholders. This can be especially present in the Life Science sector which tends to lean risk averse.
Overemphasis on Buzzwords and End Products: Initiatives often start with buzzwords like 'Gen AI', focusing primarily on the final product without addressing underlying flawed infrastructures and processes. This inclination towards technology is because technology is tangible, you can talk to vendors and see their websites, however, consistent findings from the annual report reveal that technical challenges constitute only 23.4% of the issues; the bulk stems from ingrained cultural problems within organizations. Meaning, adopting new technologies without addressing these foundational processes is likely to lead to suboptimal outcomes.
Lack of Alignment: When different departments or teams pursue the same objective without proper coordination, it results in alignment issues. These misalignments act as barriers to the seamless execution of unified data strategies, causing data management inefficiencies. Often, this situation stems from a lack of centralized guidance and strategic direction. It is important for organizational leaders to bridge these gaps through a top-down approach, ensuring all components work in harmony towards the collective aim.
Cultural Complexity: Large enterprises especially face this issue. Cultural complexities, such as diverse social and cultural preferences, values, and disparities in knowledge, can pose challenges in implementing data strategies, particularly in institutions spanning developed and developing countries. This is why taking the time to align fundamentally is especially important for the enterprise.
Misplaced Prioritization of Data: Often, data is viewed as a technology concern for IT to handle, while other teams focus on deliverables like patient service or clinical trials. This perspective is problematic as data is crucial for analyzing and ensuring the success of these initiatives. Typically, data management is considered too late in the process, when the necessary data may not have been effectively captured for analysis.

Solutions to digital transformation challenges

Questioning the status quo

Pharma companies tend to be risk averse due to the nature of the data they are responsible for. This leads to limitations and constraints to innovative solutions. These constraints are often accepted without understanding the rationale behind these constraints or challenging them. New technology offers new ways to manage data that were not available in the past. If a company remains the most risk averse and the most conservative this will stifle innovation. The key takeaway is that it is not only about technology but about the new process potential that this new tech provides.

Focus on fundamentals

This goes back to the foundation. Technology changes, initiatives change, but data truths do not. These truths involve thinking about end-to-end processes, data quality, documentation, good guidelines conventions, data governance, reducing points of failure ect. It is easy to get swept up by the latest trend and want to jump in, but you cannot build a house on quicksand; new trends like Gen AI still require these things underneath.

Fundamental alignment

This starts with aligning the team. Alignment is one of the 3 core pillars to a data-driven culture. People will start out thinking they are on the same page because they are using the same terminology. But we all have different ideas influenced by our experiences. So, it is important to gather all stakeholders and go through a true alignment process which includes figuring out the current state, pain points, and aligning on the solution.

Fundamental alignment includes adopting a top-down approach. The efforts one team is making are sure to affect the efforts of another. Leadership must align and connect the dots between all moving parts and understand that things are not living in isolation. This will solve headaches downstream and ensure a smoother process.

Quick wins

Alignment alone won’t be enough to combat culture challenges. True alignment can lead to overly ambitious projects that may prove difficult to execute. This is the classic trap of trying to do too much at once. It is important to start with manageable, small-scale projects that can provide immediate benefits. These quick wins are vital for positively influencing the organizational culture. Additionally, they help maintain momentum: if a larger initiative is slow to yield results, these smaller successes ensure continuous progress. By allowing for ongoing adjustments and reprioritizations, these projects help prevent initiatives from being abandoned.

Conclusion

While the potential of big data and analytics in transforming the Life Science sector is undeniable, and investment in data is on the rise, the journey towards a data-driven culture remains one of the biggest challenges for data integration. The path forward requires a balanced approach that combines technology with fundamental changes in culture and processes. By embracing these changes, the life sciences sector can overcome existing barriers and fully harness the power of data to advance medical science and improve patient outcomes. At Datacoves we are passionate about helping companies achieve a data-driven culture. See how Datacoves helped Johnson&Johnson innovate their tech stack.

Prefer a Video?

This article was inspired by RAN BioLinks’ podcast episode “Why Life Science Organizations Fail to Implement Effective Data Strategies”. The detailed conversation with Noel Gomez, a seasoned expert in data management within the life science industry, explores the critical challenges and innovative solutions for effective data strategies in healthcare.

Dbt & Airflow

dbt Deployment Options

5 mins read

dbt is wildly popular and has become a fundamental part of many data stacks. While it’s easy to spin up a project and get things running on a local machine, taking the next step and deploying dbt to production isn’t quite as simple.

In this article we will discuss options for deploying dbt to production, comparing some high, medium, and low effort options so that you can find which works best for your business and team. You might be deploying dbt using one of these patterns already; if you are, hopefully this guide will help highlight some improvements you can make to your existing deployment process.

We're going to assume you know how to run dbt on your own computer (aka your local dbt setup). We’re also going to assume that you either want to or need to run dbt in a “production” environment – a place where other tools and systems make use of the data models that dbt creates in your warehouse.

Enhancing understanding of dbt deployment

The deployment process for dbt jobs extends beyond basic scheduling and involves a multifaceted approach. This includes establishing various dbt environments with distinct roles and requirements, ensuring the reliability and scalability of these environments, integrating dbt with other tools in the (EL)T stack, and implementing effective scheduling strategies for dbt tasks. By focusing on these aspects, a comprehensive and robust dbt deployment strategy can be developed. This strategy will not only address current data processing needs but also adapt to future challenges and changes in your data landscape, ensuring long-term success and reliability.

dbt environments

In deploying dbt you have the creation and management of certain dbt environments. The development environment is the initial testing ground for creating and refining dbt models. It allows for experimentation without impacting production data. Following this, the testing environment, including stages like UAT and regression testing, rigorously evaluates the models for accuracy and performance. Finally, the production environment is where these models are executed on actual data, demanding high stability and performance.

Data reliability and scalability

Reliability and scalability of data models are also important. Ensuring that the data models produce accurate and consistent results is essential for maintaining trust in your data. As your data grows, the dbt deployment should be capable of scaling, handling increased volumes, and maintaining performance.

End-to-End data pipeline integration

Integration with other data tools and systems is another key aspect. A seamless integration of dbt with EL tools, data visualization platforms, and data warehouses ensures efficient data flow and processing, making dbt a harmonious component of your broader data stack.

dbt job scheduling

Effective dbt scheduling goes beyond mere time-based scheduling. It involves context-aware execution, such as triggering jobs based on data availability or other external events. Managing dependencies within your data models is critical to ensure that transformations occur in the correct sequence. Additionally, adapting to varying data loads is necessary to scale resources effectively and maintain the efficiency of dbt job executions.

The main flavors for dbt deployment are:

Cron Job
Cloud Runner Service (like dbt Cloud)
Integrated Platform (like Datacoves)
Fully Custom (like Airflow, Astronomer, etc)

They each have their place, and the trade-offs between setup costs and long-term maintainability is important to consider when you’re choosing one versus another.

We can compare these dbt deployment options across the following criteria:

Ease of Use / Implementation
Required Technical Ability
Configurability
Customization
Best for End-to End Deployment

Cron job

Cron jobs are scripts that run at a set schedule. They can be defined in any language. For instance, we can use a simple bash script to run dbt. It’s just like running the CLI commands, but instead of you running them by hand, a computer process would do it for you.

Here’s a simple cron script:

In order to run on schedule, you’ll need to add this file to your system’s crontab.

As you can tell, this is a very basic dbt run script; we are doing the bare minimum to run the project. There is no consideration for tagged models, test, alerting, or more advanced checks.

Even though Cron jobs are the most basic way to deploy dbt there is still a learning curve. It requires some technical skills to set up this deployment. Additionally, because of its simplicity, it is pretty limited. If you are thinking of using crons for multi-step deployments, you might want to look elsewhere.

While it's relatively easy to set up a cron job to run on your laptop this defeats the purpose of using a cron altogether. Crons will only run when the daemon is running, so unless you plan on never turning off your laptop, you’ll want to set up the cron on an EC2 instance (or another server). Now you have infrastructure to support and added complexity to keep in mind when making changes. Running a cron on an EC2 instance is certainly doable, but likely not the best use of resources. Just because it can be done does not mean it should be done. At this point, you’re better off using a different deployment method.

The biggest downside, however, is that your cron script must handle any edge cases or errors gracefully. If it doesn’t, you might wind up with silent failures – a data engineer’s worst enemy.

Who should use Cron for dbt deployment?

Cron jobs might serve you well if you have some running servers you can use, have a strong handle on the types of problems your dbt runs and cron executions might run into, and you can get away with a simple deployment with limited dbt steps. It is also a solid choice if you are running a small side-project where missed deployments are probably not a big deal.

Use crons for anything more complex, and you might be setting yourself up for future headaches.

Ease of Use / Implementation – You need to know what you’re doing

Required Technical Ability – Medium/ High

Configurability – High, but with the added complexity of managing more complex code

Customization – High, but with a lot of overhead. Best to keep things very simple

Best for End-to-End Deployment - Low.

Cloud Service Runners (like dbt Cloud)

Cloud Service Runners like dbt Cloud are probably the most obvious way to deploy your dbt project without writing code for those deployments, but they are not perfect.

dbt Cloud is a product from dbt Labs, the creators of dbt. The platform has some out-of-the-box integrations, such as Github Actions and Webhooks, but anything more will have to be managed by your team. While there is an IDE (Integrated Developer Experience) that allows the user to write new dbt models, you are adding a layer of complexity by orchestrating your deployments in another tool. If you are only orchestrating dbt runs, dbt Cloud is a reasonable choice – it's designed for just that.

However, when you want to orchestrate more than just your dbt runs – for instance, kickoff multiple Extract / Load (EL) processes or trigger other jobs after dbt completes – you will need to look elsewhere.

dbt Cloud will host your project documentation and provide access to its APIs. But that is the lion’s share of the offering. Unless you spring for the Enterprise Tier, you will not be able to do custom deployments or trigger dbt runs based on incoming data with ease.

Deploying your dbt project with dbt Cloud is straightforward, though. And that is its best feature. All deployment commands use native dbt command line syntax, and you can create various "Jobs" through their UI to run specific models at different cadences.

Who should use dbt Cloud for dbt deployment?

If you are a data team with data pipelines that are not too complex and you are looking to handle dbt deployments without the need for standing up infrastructure or stringing together advanced deployment logic, then dbt Cloud will work for you. If you are interested in more complex triggers to kickoff your dbt runs - for instance, triggering a run immediately after your data is loaded – there are other options which natively support patterns like that. The most important factor is the complexity of the pieces you need to coordinate, not necessarily the size of your team or organization.

Overall, it is a great choice if you’re okay working within its limitations and support a simple workflow. As soon as you reach any scale, however, the cost may be too high.

Ease of Use / Implementation – Very easy

Required Technical Ability – Low

Configurability – Low / Medium

Customization – Low

Best for End-to-End Deployment - Low

Integrated platform (like Datacoves)

The Modern Data Stack is a composite of tools. Unfortunately, many of those tools are disconnected because they specialize in handling one of the steps in the ELT process. Only after working with them do you realize that there are implicit dependencies between these tools. Tools like Datacoves bridge the gaps between the tools in the Modern Data Stack and enable some more flexible dbt deployment patterns. Additionally, they cover the End-to-End solution, from Extraction to Visualization, meaning it can handle steps before and after Transformation.

Coordinating dbt runs with EL processes

If you are loading your data into Snowflake with Fivetran or Airbyte, your dbt runs need to be coordinated with those EL processes. Often, this is done by manually setting the ETL schedule and then defining your dbt run schedule to coincide with your ETL completion. It is not a hard dependency, though. If you’re processing a spike in data or running a historical re-sync, your ETL pipeline might take significantly longer than usual. Your normal dbt run won’t play nicely with this extended ETL process, and you’ll wind up using Snowflake credits for nothing.

This is a common issue for companies moving from early stage / MVP data warehouses into more advanced patterns. There are ways to connect your EL processes and dbt deployments with code, but Datacoves makes it much easier. Datacoves will trigger the right dbt job immediately after the load is complete. No need to engineer a solution yourself. The value of the Modern Data Stack is being able to mix and match tools that are fit for purpose.

Seamless integration and orchestration

Meeting established data freshness and quality SLAs is challenging enough, but with Datacoves, you’re able to skip building custom solutions for these problems. Every piece of your data stack is integrated and working together. If you are orchestrating with Airflow, then you’re likely running a Docker container which may or may not have added dependencies. That’s one common challenge teams managing their own instances of Airflow will meet, but with Datacoves, container / image management and synchronization between EL and dbt executions are all handled on the platform. The setup and maintenance of the scalable Kubernetes infrastructure necessary to run Airflow is handled entirely by the Datacoves platform, which gives you flexibility but with a lower learning curve. And, it goes without saying that this works across multiple environments like development, UAT, and production.

Streamlining the Datacoves experience

With the End-to-End Pipeline in mind, one of the convenient features is that Datacoves provides a singular place to access all the tools within your normal analytics workflow - extraction + load, transformation, orchestration, and security controls are in a single place. The implicit dependencies are now codified; it is clear how a change to your dbt model will flow through to the various pieces downstream.

Who should use Datacoves for dbt deployment?

Datacoves is for teams who want to introduce a mature analytics workflow without the weight of adopting and integrating a new suite of tools on their own. This might mean you are a small team at a young company, or an established analytics team at an enterprise looking to simplify and reduce platform complexity and costs.

There are some prerequisites, though. To make use of Datacoves, you do need to write some code, but you’ll likely already be used to writing configuration files and dbt models that Datacoves expects. You won't be starting from scratch because best practices, accelerators, and expertise are already provided.

Ease of Use / Implementation – You can utilize YAML to generate DAGS for a simpler approach, but you also have the option to use Python DAGS for added flexibility and complexity in your pipelines.

Required Technical Ability – Medium

Configurability – High

Customization – High. Datacoves is modular, allowing you to embed the tools you already use

Best for End-to-End Deployment - High. Datacoves takes into account all of the factors of dbt Deployment

Fully custom (like Airflow, Dagster, etc)

What do you use to deploy your dbt project when you have a large, complex set of models and dependencies? An orchestrator like Airflow is a popular choice, with many companies opting to use managed deployments through services such as Astronomer.

For many companies – especially in the enterprise – this is familiar territory. Adoption of these orchestrators is widespread. The tools are stable, but they are not without some downsides.

These orchestrators require a lot of setup and maintenance. If you’re not using a managed service, you’ll need to deploy the orchestrator yourself, and handle the upkeep of the infrastructure running your orchestrator, not to mention manage the code your orchestrator is executing. It’s no small feat, and a large part of the reason that many large engineering groups have dedicated data engineering and infrastructure teams.

Running your dbt deployment through Airflow or any other orchestrator is the most flexible option you can find, though. The increase in flexibility means more overhead in terms of setting up the systems you need to run and maintain this architecture. You might need to get DevOps involved, you’ll need to move your dbt project into a Docker image, you’ll want an airtight CI/CD process, and ultimately have well defined SLAs. This typically requires Docker images, container management, and some DevOps work. There can be a steep learning curve, especially if you’re unfamiliar with what’s needed to take an Airflow instance to a stable production release.

There are 3 ways to run Airflow, specifically – deploying on your own, using a managed service, or using an integrated platform like Datacoves. When using a managed service or an integrated platform like Datacoves, you need to consider a few factors:

Airflow use cases

Ownership and contributing teams

Integrations with rest of your stack

Airflow use cases

Airflow is a multi-purpose tool. It’s not just for dbt deployments. Many organizations run complex data engineering pipelines with Airflow, and by design, it is flexible. If your use of Airflow extends well beyond dbt deployments or ELT jobs oriented around your data warehouse, you may be better suited for a dedicated managed service.

Ownership and contributing teams

Similarly, if your organization has numerous teams dedicated to designing, building and maintaining your data infrastructure, you may want to use a dedicated Airflow solution. However, not every organization is able to stand up platform engineering teams or DevOps squads dedicated to the data infrastructure. Regardless of the size of your team, you will need to make sure that your data infrastructure needs do not outmatch your team’s ability to support and maintain that infrastructure.

Integrations with the rest of your stack

Every part of the Modern Data Stack relies on other tools performing their jobs; data pipelines, transformations, data models, BI tools - they are all connected. Using Airflow for your dbt deployment adds another link in the dependency chain. Coordinating dbt deployments via Airflow can always be done through writing additional code, but this is an additional overhead you will need to design, implement, and maintain. With this approach, you begin to require strong software engineering and design principles. Your data models are only as useful as your data is fresh; meeting your required SLAs will require significant cross-tool integration and customization.

Who should use a fully custom setup for dbt deployment?

If you are a small team looking to deploy dbt, there are likely better options. If you are a growing team, there are certainly simpler options with less infrastructure overhead. For Data teams with complex data workflows that combine multiple tools and transformation technologies such as Python, Scala, and dbt, however, Airflow and other orchestrators can be a good choice.

Ease of Use / Implementation – Can be quite challenging starting from scratch

Required Technical Ability – High

Configurability – High

Customization – High, but build time and maintenance costs can be prohibitive ‍

Best for End-to-End Deployment - High, but requires a lot of resources to set up and maintain

dbt Deployment Options Overview

	Cron Job	Cloud Runner	Integrated Platform	Fully Custom Deployment
Ease of Use/ Implementation	Need to know what you’re doing and only good for simple use cases.	Very easy to start up	Simple using YAML with ability to build more complex pipelines with Python.	Challenging for most
Required Technical Ability	Medium/ High	Low	Medium	High
Configurability	High	Low / Medium	High	High
Customization	High	Low	High	High
End-to-End Deployment	Low	Low	High	High

Final thoughts

The way you should deploy your dbt project depends on a handful of factors – how much time you’re willing to invest up front, your level of technical expertise, as well as how much configuration and customization you need.

Small teams might have high technical acumen but not enough capacity to manage a deployment on their own. Enterprise teams might have enough resources but maintain disparate, interdependent projects for analytics. Thankfully, there are several options to move your project beyond your local and into a production environment with ease. And while specific tools like Airflow have their own pros and cons, it’s becoming increasingly important to evaluate your data stack vendor solution holistically. Ultimately, there are many ways to deploy dbt to production, and the decision comes down to spending time building a robust deployment pipeline or spending more time focusing on analytics.

Dbt & Airflow

Ultimate dbt Jinja Functions Cheat Sheet

5 mins read

Jinja is the game changing feature of dbt Core that allows us to create dynamic SQL code. In addition to the standard Jinja library, dbt Core includes additional functions and variables to make working with dbt even more powerful out of the box.

See our original post, The Ultimate dbt Jinja Cheat Sheet, to get started with Jinja fundamentals like syntax, variable assignment, looping and more. Then dive into the information below which covers Jinja additions added by dbt Core.

This cheatsheet references the extra functions, macros, filters, variables and context methods that are specific to dbt Core.

Enjoy!

dbt Core: Functions

These pre-defined dbt Jinja functions are essential to the dbt workflow by allowing you to reference models and sources, write documentation, print, write to the log and more.

Functions
ref(project_or_package, model_name, version=latest_version)	This function allows you to reference models within or accross dbt projects. The version and project_or_package parameters are optional. `select * from {{ ref('model_name') }} select * from {{ ref('model_name', version=1) }} select * from {{ ref('project_or_package', 'model_name') }}`
source(source_name, table_name )	This function allows you to create a dependency between the current model and a source. `select * from {{ source('source_name', 'table_name') }}`
log(msg, info=False)	This function writes a message to the log and/or stdout. To write to the both the log file and stdout set the info arg to True. If set to False, the function will only write to the log file. `{{ log('Logging something', info=True) }}`
print("string to print")	Use this function when you wish to print messages to the log and stdout. This is the same as log() with info=True. `{{ print("Print something: " ~ value1 ~ ", " ~ value2) }}`
doc('doc_block_name')	The doc function allows you reference {% docs blocks %} from markdown (.md) files in the description field of property files(.yml). It is important to note that the doc() function is how you reference the contents of a {% docs block %}. `{% docs table_sales_description %} # docs - go - here {% enddocs %} #schema.yml version: 2 models: - name: sales description: '{{ doc("table_sales_description") }}'` The {% docs block %} can be used to override preset overviews.
return(data)	Use this function to return data to the caller. This function will preserve the data type: dict, list, int etc. `{% macro get_list() %} {{ return(['column_1', 'column_2', 'column_2']) }} {% endmacro %}` `select -- getdata() returns a list! {% for item in get_list() %} {{ item }} {% if not loop.last %},{% endif %} {% endfor %} from {{ ref('some_model') }}`
env_var('env_var_name', 'default_value')	env_var is used to incorporate environment variables anywhere you can use Jinja code. `"{{ env_var('DB_PORT') \| as_number }}"` Use the optional second argument to set a default and avoid compilation errors. `#dbt_project.yml models: my_project: +materialized: "{{ env_var('DBT_MATERIALIZATION', 'table') }}"`
var('variable_name')	Use when configuring packages in multiple environments or defining values across multiple models in a package. For variables that rarely change, define them in the dbt_project.yml file. Variables that change frequently can be defined through the CLI. `#dbt_project.yml vars: start_date: '2023-10-01'` `select * from orders where start_date >'{{ var("start_date") }}'`

dbt Core: Macros

These macros are provided in dbt Core to improve the dbt workflow.

Macros
run_query()	This macro allows you to run SQL queries and store their results. It returns a table object with the query result. `{% set results = run_query('select order from orders_table') %}`
debug()	Call this macro within a macro to open up a python debugger. The DBT_MACRO_DEBUGGING environment variable must be set. The debug macro is only available in dbt Core/Datacoves. Only use {{ debug }} in a development environment. `{% macro my_macro() %} {% set result = my_macro() %} {{ debug() }} {% endmacro %}`

dbt Core: Filters

These dbt Jinja filters are used to modify data types.

Filters
as_bool	This Jinja filter will coerce the output into a boolean value. If the conversion can’t happen then it will return an error. `{{ <my_value> \| as_bool }}`
as_number	This Jinja filter will coerce the output into an numeric value. If the conversion can’t happen then it will return an error. `{{ <my_value> \| as_number }}`
as_native	This Jinja filter will return the native python type (set, list, tuple, dict, etc). Best for iterables, use as_bool or as_number for boolean or numeric values. `{{ <my_value> \| as_native }}`

dbt Core: Project context variables

These dbt core "variables" such as config, target, source, and others contain values that provide contextual information about your dbt project and configuration settings. They are typically used for accessing and customizing the behavior of your dbt models based on project-specific and environment-specific information.

Project Context Variables
adapters	dbt uses the adapter to communicate with the database. Setup correctly using the Database specific adapter i.e Snowflake, RedShift. The adapter has methods that will be translated into SQL statements specific to your database.
builtins	The builtins variable is a dictionary containing keys for dbt context methods ref, source, and config. `{% macro ref(modelname) %} {% set db_name = builtins.ref(modelname).database \| lower %}} {% if db_name.startswith('staging') or db_name.endswith('staging') %} {{ return(builtins.ref(modelname).include(database=false)) }} {% else %} {{ return(builtins.ref(modelname)) }} {% endif %} {% endmacro %}`
config	The config variable allows you to get configuration set for dbt models and set constraints that assure certain configurations must exist for a given model: config.get('<config_key>', default='value'): Fetches a configuration named <config_key> for example "materialization". - If no default is provided or the configuration item is not defined, it this will return None `{%- set unique_key = config.get('unique_key', default='id') -%}` config.require("<config_key>"): Strictly requires a key named <config_key> is defined in the configuration. - Throws error if not set. `{%- set unique_key = config.require('unique_key') -%}` config.get and config.require are commonly seen in the context of custom materializations, however, they can be used in other macros, if those macros are used within a model, seed, or snapshot context and they have the relevant configurations set.
dbt_version	The dbt_version variable is helpful for debugging by returning the installed dbt version. This is the dbt version running not what you define in your project. e.g. If you make a project with dbt 1.3 and run it on another machine with dbt 1.6, this will say 1.6
execute	The execute variable is set to True when dbt runs SQL against the databse such as when executing dbt run. When running dbt commands such as dbt compile where dbt parses your project but no SQL is run against the database execute is set to False. This variable is helpful when your Jinja is relying on a result from the database. Wrap the jinja in an if statement. `{% set payment_method_query %} SELECT DISTINCT payment_method FROM {{ ref('raw_payments') }} ORDER BY 1 {% endset %} {% set payment_methods = run_query(payment_method_query) %} {% if execute %} {# Extract the payment methods from the query results #} {% set payment_methods_list = payment_methods.columns[0].values() %} {% else %} {% set payment_methods_list = [] %} {% endif %}`
flags	This variable holds the value of the flags passed in the command line such as FULL_REFRESH, STORE_FAILURES, and WHICH(compile, run, build, run-operation, test, and show). It allows you to set logic based on run modes and run hooks based on current commands/type. `{% if flags.STORE_FAILURES %} --your logic {% else %} --other logic {% endif %}`
graph	The graph variable is dictionary that hold the information about the nodes(Models, Sources, Tests, Snapshots) in your dbt project.
model	This is the graph object of the current model. This object allows you to access the contents of a model, model structure and JSON schema, config settings, and the path to the model.
modules	This variable contains Python Modules for working with data including:datetime, pytz, re, and itertools. `modules.<desired_module>.<desired_function>` `{% set now = modules.datetime.datetime.now() %}`
project_name	This variable returns the name for the root-level project that is being run.
target	This variable contains information about your dbt profile target, such as your warehouse connection information. Use the dot notation to access more such as: target.name or target.schema or target.database, ect

dbt Core: Run context variables

These special variables provide information about the current context in which your dbt code is running, such as the model, schema, or project name.

Run Context Variablels
database_schemas	Only available in the context for on-run-end. This variable allows you to reference the databases and schemas. Useful if using multiple different databases
invocation_id	This function outputs a UUID every time you run or compile your dbt project. It is useful for auditing. You may access it in the query-comment, info dictionary in events and logs, and in the metadata dictionary in dbt artifacts.
results	Only available in the context for on-run-end. This variable contains a list of Results Objects. Allows access to the information populated in run results JSON artifact.
run_started_at	This variable outputs a timestamp for the start time of a dbt run and defaults to UTC. It Is a python datetime object. Use standard strftime formatting. `select '{{ run_started_at.strftime("%Y-%m-%d") }}' as run_start_utc`
schemas	Only available in the context for on-run-end. This variable allows you to reference a list of schemas for models built during a dbt run. Useful for granting privileges.
selected_resources	This variable allows you to access a list of selected nodes from the current dbt command. The items in the list depend on the parameters of —select, —exclude,—selector.
this	{{ this }} is the database representation of the current model. Use the dot notation to access more properties such as: {{ this.database }} and {{ this.schema }}.
thread_id	The thread_id is a unique identifier assigned to the Python threads that are actively executing tasks, such as running code nodes in dbt. It typically takes the form of names like "Thread-1," "Thread-2," and so on, distinguishing different threads.

dbt Core: Context methods

These methods allow you to retrieve information about models, columns, or other elements of your project.

Context Methods
set(value, default)	Allows you to use the python method set(), which converts an iterable into a unique set of values. This is NOT the same as the jinja expression set which is used to assign a value to a variable. This will return none if the value entered is not an iterable. In the example below both the python set() method and the jinja set expression are used to remove a duplicate element in the list. `{% set my_list = ['a', 'b', 'b', 'c'] %} {% set my_set = set(my_list) %} {% do print(my_set) %} {# evaluates to {'a','b', 'c'} #}`
set_strict(value)	Same as the set method above however it will raise a TypeError if the entered value is not a valid iterable. `{% set my_set = set_strict(my_list) %}`
exceptions	Is used to raise errors and warnings in a dbt run: raise_compiler_error will raise an error, print out the message. Model will FAIL. `exceptions.raise_compiler_error("Custom Error message")` warn will raise a compiler warning and print out the set method. Model will still PASS. `exceptions.warn("Custom Warning message")`
fromjson(string, default)	Is used to deserialize a JSON string into a python dict or list.
fromyaml(string, default)	Is used to deserialize a YAML string into a python dict or list.
tojson(value, default)	Serializes a Python dict or list to a JSON string.
toyaml(value, default )	Serializes a Python dict or list to a YAML string.
local_md5	This variable locally creates an MD5 hash of the given string. `{%- set hashed_value = local_md5(“String to hash”) -%} select '{{ hashed_value }}' as my_hashed_value`
zip(*args, default)	This method is allows you to combine any number of iterables. `{% set my_list_a = [1, 2] %} {% set my_list_b = ['Data', 'Coves'] %} {% set my_zip = zip(my_list_a, my_list_b) \| list %} {% do print(my_zip) %} {# Result [(1, 'Data'), (2, 'Coves')] #}`
zip_strict(value)	Same as the zip method but will raise a TypeError if one of the given values are not a valid iterable. `{% set my_list_a = 123 %} {% set my_list_b = 'Datacoves' %} {% set my_zip = zip_strict(my_list_a, my_list_b) \| list %} {# This will fail #}`

Please contact us with any errors or suggestions.

Dbt & Airflow

dbt Core vs dbt Cloud – Key Differences as of 2025

5 mins read

If you've taken an interest in dbt (data build tool) and are on the fence about whether to opt for dbt Cloud or dbt Core, you're in the right place. Perhaps you're already using one of the dbt platforms and are considering a change. Regardless of your current position, understanding the differences of these options is crucial for making an informed decision. In this article, we'll delve deep into the key distinctions between dbt Cloud and dbt Core.

dbt, dbt Core, dbt Cloud: What do they mean?

For those new to the dbt community, navigating the terminology can be a tad confusing. "dbt," "dbt Core," and "dbt Cloud" may sound similar but each represents a different facet of the dbt ecosystem. Let's break it down.

via GIPHY

What is dbt?

dbt is the generic name for the open-source tool and when people say dbt the features are mainly those in dbt Core. dbt allows users to write, document, and execute SQL-based transformations, making it easier to produce reliable and up-to-date analytics. By facilitating practices like version control, testing, and documentation, dbt enhances the analytics engineering workflow, turning raw data into actionable insights.

Once you decide dbt is right for your organization, the next step is to determine how you'll access dbt. The two most prevalent methods are dbt Core and dbt Cloud. While dbt Cloud offers an enhanced experience with additional features, its abstraction can sometimes limit the desired flexibility and control over the workflow especially when it comes to using dbt with the complexities of an enterprise.

Throughout this article we'll observe that by using dbt Core and incorporating other tools, you can achieve many of the same functionalities as dbt Cloud while maintaining flexibility and control. While this approach offers enhanced flexibility, it consequently introduces increased complexity, maintenance, and an added workload. When adopting a dbt platform it is important to understand the tradeoffs to truly know what will work best for your data team.

What is dbt Core?

dbt Core is an open-source data transformation tool that enables data analysts and engineers to transform and model data to derive business insights. dbt Core is the foundational, open-source version of dbt that provides users with the utmost flexibility. The term "flexible" implies that users have complete autonomy over its implementation, integration, and configuration within their projects.

Even though dbt Core is free, to meet or exceed the functionality of dbt Cloud, it will need to be paired with additional tooling as we will discuss below.These open source solutions may be leverage at no cost, but this increases the platform maintenance overhead and may impact the total cost of ownership and the platform's time to market. Alternatively, managed dbt Core platforms exist, like Datacoves, which simplify this process.

Using and installing dbt Core is done manually. Depending on which data warehouse you are using, you select the appropriate dit adapter such as dit-snowflake, dbt-databricks, dt-redshift, etc. You can see all available dbt adapters on our dbt libraries. If you are using Snowflake you can check out our detailed Snowflake with dbt getting started guide.

Given that you have installed the pre-requisites, installing dbt is just a matter of installing dbt-snowflake.

dbt-snowflake CLI installation ‍ — dbt-snowflake CLI installation

What is dbt Cloud?

dbt Cloud is a hosted dbt platform to develop and deploy dbt projects. dbt Cloud leverages all the power of dbt Core with some extra features such as a proprietary Web-based UI, a dbt job scheduler, APIs, integration with Continuous Integrations platforms like Github Actions, and a proprietary Semantic layer. dbt Cloud's features are all intended to facilitate the dbt workflow.

dbt cloud pricing has three tiers: Enterprise, Team and Developer. Developer is a free tier meant for a single developer with a hard limit of 3000 model runs per month. The Team Plan pricing starts at $100 per developer for teams up to 8 with 15,000 successful models built per month; any additional models will cost $0.01.

dbt Core vs dbt Cloud: Development - IDE

When it comes to the Integrated Development Environment (IDE), both dbt Cloud and dbt Core present distinct advantages and challenges. Whether you prioritize flexibility, ease of setup, or a blend of both, your choice will influence how your team develops, tests, and schedules your data transformations. Let's explore how each option handles the IDE aspect and the impact on developers and analytic engineers.

dbt Core

In the instance of IDEs, using dbt Core requires setting up a dev environment on each member's device or a virtual space like AWS workspace. This involves installing a popular dbt IDE such as VS Code, dbt Core, connecting to a data warehouse, and handling dependencies like Python versions.

Enterprise dbt setups typically include additional dependencies to enhance productivity. Some notable VS Code extensions for this include dbt Power User, SQLFluff, and the official dbt Snowflake VS Code extension.

VsCode IDE with dbt Core — VS Code IDE with dbt Core

dbt Cloud

When companies are ramping up with dbt, one of the pain points is setting up and managing dbt IDE environments. Analytic Engineers coming to dbt may not be familiar with concepts like version control with git or using the command line. The dbt Cloud IDE simplifies developer onboarding by providing a web-based SQL IDE to team members so they can easily write, test, and refine data transformation code without having to install anything on their computers. Complexities like starting a git branch are tucked behind a big colorful button so users know that is the first step in their new feature journey.

However, Developers who are accustomed to more versatile local IDEs, such as VS Code, may find the dbt Cloud experience limiting as they cannot leverage extensions such as those from the VS Code Marketplace nor can they extend dbt Core using the vast array of Python libraries.

It is possible to get the best of both worlds - the flexibility of dbt Core in VS Code and the quick setup that dbt Cloud Offers - with a Managed dbt Core Platform like Datacoves. In a best-in-class developer setup, new users are onboarded in minutes with simple configuration screens that remove the need to edit text files such as profiles.yml and remove the complexity of creating and managing SSH keys. Version upgrades of dbt or any dependent library should be transparent to users. Spinning up a pristine environment should be a matter of clicks.

dbt Core vs dbt Cloud: Scheduling jobs

Scheduling in a dbt project is crucial for ensuring timely and consistent data updates. It's the backbone of reliable and up-to-date analytics in a dbt-driven environment.

dbt Core

While an orchestrator does not come out of the box with dbt Core, when setting up a deployment environment companies can leverage any orchestration tool, such as Airflow, Dagster, or Prefect. They can connect steps prior to or after the dbt transformations and they can trigger any tool that exists within or outside the corporate network.

dbt Cloud

dbt Cloud makes deploying a dbt Core project simple. It allows you to define custom environment variables and the specific dbt commands (seed, run, test) that you want to run during production runs. The dbt Cloud scheduler can be configured to trigger at specific intervals using an intuitive UI.

dbt Cloud is primarily focused on running dbt projects. Therefore, if a data pipeline has more dependencies, an external orchestration tool may be required. Fortunately, if you do use an external orchestrator, dbt Cloud offers an API to trigger dbt Cloud jobs from your orchestrator.

dbt Core vs dbt Cloud: DataOps - Continuous Integration

DataOps emphasizes automating the integration of code changes, ensuring that data transformations are consistently robust and reliable. Both platforms approach CI/CD differently. How seamless is the integration? How does each platform handle tool compatibility?

dbt Core

When using dbt Core for your enterprise data platform, you will need to not only define and configure the automation scripts, but you will also need to ensure that all the components, such as a git server, CI server, CI runners, etc. are all working harmoniously together.

Since dbt Core can be run within the corporate firewall, it can be integrated with any CI tool and internal components such as Jira, Bitbucket, and Jenkins. To do this well, all the project dependencies must be packed into reusable Docker containers. Notifications will also need to be defined across the various components and all of this will take time and money.

dbt Cloud

dbt Cloud has built in CI capabilities which reduce the need for third party tools. dbt Cloud can also be paired with Continuous Integration (CI) tools like GitHub Actions to validate data transformations before they are added to a production environment. Aspects such as code reviews and approvals will occur in the CI/CD tool of choice such as GitHub and dbt Cloud can report job execution status back to GitHub Actions. This allows teams to know when it is safe to merge changes to their code. One item to note is that each successful model run in your CI run will count against the monthly model runs as outlined in the dbt Cloud pricing.

Companies that have tools like Bitbucket, Jira, and Jenkins within their corporate firewall may find it challenging to integrate with dbt Cloud.

dbt Core vs dbt Cloud: Semantic layer

A semantic layer helps businesses define important metrics like sales, customer churn, and customer activations with the flexibility to aggregate at run time. These metrics can be referenced by downstream tools as if they had been previously computed. End-users benefit from the flexibility to aggregate metrics at diverse grains without the company incurring the cost of pre-computing every permutation. These on-the-fly pivots ensure consistent and accurate insights across the organization.

dbt Core

dbt Core does not come with a built-in semantic layer, but there are open source and proprietary alternatives that allow you to achieve the same functionality. These include cube.dev, and Lightdash.

dbt Cloud

dbt Cloud has been rolling out a proprietary semantic layer which is currently in public preview. This feature is only available to dbt pricing plans Team and Enterprise. When using the dbt Cloud semantic layer your BI tool connects to a dbt Cloud proxy server which sits between the BI tool and your Data Warehouse.

dbt’s semantic layer offers a system where metrics are standardized as dbt metadata, visualized in your DAG, and integrated seamlessly with features like the Metadata API and the dbt proxy server.

dbt Core vs. dbt Cloud: Project Exploration and Lineage Tracking

Understanding your dbt project's structure and data flow is essential for effective data management and collaboration. While dbt Cloud offers dbt Explorer, a tool that visually maps model dependencies and metadata, it is exclusive to dbt Cloud users.

dbt Docs (dbt docs generate) is a built-in feature in dbt Core that generates a static documentation site, providing lineage graphs and detailed metadata for models, columns, and tests. However, for larger projects, dbt Docs can struggle with high memory usage and slow load times, making it less practical for extensive datasets. Also, dbt Docs lacks column-level lineage, which is crucial for impact analysis and debugging.

But no worries—dbt Core users can achieve similar, and even better, functionalities through alternative methods. The answer: a data catalog like DataHub. A Data Catalog can significantly enhance not just your dbt exploration, but your entire data project discovery experience!

DataHub Offers:

Searchable Metadata with Comprehensive Integration – DataHub can ingest metadata from various sources, including Apache Airflow and dbt, consolidating everything into a single repository. This creates a unified view of your data ecosystem, surpassing the capabilities of dbt Explorer.
Column-Level Lineage – See how data moves and transforms at the column level, enabling more precise impact analysis and debugging.
Collaborative Features – Facilitate teamwork with tools designed for sharing and documenting data knowledge across your organization such as a data glossary.

There is an obvious caveat. Implementing and maintaining an open-source data catalog like DataHub introduces additional complexity. Organizations need to allocate resources to manage, update, and scale the platform effectively. Fortunately, a managed solution like Datacoves simplifies this by providing an integrated offering that includes DataHub, streamlining deployment and reducing maintenance overhead.

dbt Core vs dbt Cloud: API

APIs play a crucial role in streamlining dbt operations and enhancing extensibility.

dbt Core

With dbt Core, users often rely on external solutions to integrate specific API functionalities.

Administrative API Alternative: There is currently no feature-to-feature alternative for the dbt Cloud administrative API. However, the Airflow API can be leveraged to enqueue runs for jobs which is a primary feature of the dbt Cloud Administrative API.

Discovery API Alternative: This API was formerly known as the dbt Cloud Metadata API. Tools such as Datahub can provide similar functionality. Datahub can consume dbt Core artifacts such as the manifest.json and expose an API for dbt metadata consumption.

Semantic Layer API Alternative: When it comes to establishing and managing the semantic layer, Cube.dev provides a mature, robust, and comprehensive alternative to the dbt Cloud Semantic layer. Cube also has an API tailored for this purpose.

dbt Cloud

dbt Cloud offers three APIs. These APIs are available to Team and Enterprise customers.

Administrative API: The dbt Cloud Administrative API is designed primarily for tasks like initiating runs from a job, monitoring the progress of these runs, and retrieving artifacts once the jobs have been executed. dbt Cloud is working on additional functionality for this API, such as operational functions within dbt Cloud.

Discovery API: Whenever you run a project in dbt Cloud, it saves details about that project, such as information about your data models, sources, and how they connect. The Discovery API lets you access and understand this saved information. Use cases include: performance, quality, discovery, governance and development.

Semantic Layer API: The dbt Semantic Layer API provides a way for users to interact with their data using a JDBC driver. By using this API, you can easily query metrics values from your data and get insights.

Conclusion

Examining the differences between dbt Core and dbt Cloud reveals that both can lead organizations to similar results. Much of what dbt Cloud offers can be replicated with dbt Core when combined with appropriate additional tools. While this might introduce some complexities, the increased control and flexibility might justify the trade-offs for certain organizations. Thus, when deciding between the two, it's a matter of prioritizing simplicity versus adaptability for the team. This article only covers dbt core vs dbt cloud but you can read more about dbt alternatives in our blog..

As a managed dbt Core solution, the Datacoves platform simplifies the dbt Core experience and retains its inherent flexibility. It effectively bridges the gap, capturing many benefits of dbt Cloud while mitigating the challenges tied to a pure dbt Core setup. See if Datacoves dbt pricing is right for your organization or visit our product page.