Jinja templating in dbt offers flexibility and expressiveness that can significantly improve SQL code organization and reusability. There is a learning curve, but this cheat sheet is designed to be a quick reference for data practitioners, helping to streamline the development process and reduce common pitfalls.
Whether you're troubleshooting a tricky macro or just brushing up on syntax, bookmark this page. Trust us, it will come in handy and help you unlock the full potential of Jinja in your dbt projects.
If you find this cheat sheet useful, be sure to check out our Ultimate dbt Jinja Functions Cheat Sheet. It covers the specialized Jinja functions created by dbt, designed to enhance versatility and expedite workflows.
This is the foundational syntax of Jinja, from how to comment to the difference between statements and expressions.
Basic Syntax | |
---|---|
Comments |
|
Statements |
|
Expressions |
|
Define and assign variables in different data types such as strings, lists, and dictionaries.
Variable Assignment | |
---|---|
Strings |
|
Lists |
|
Dictionaries |
|
Jinja allows fine-grained control over white spaces in compiled output. Understand how to strategically strip or maintain spaces.
White Space Control | |
---|---|
Strip Before |
|
Strip After |
|
String Before and After |
|
In dbt, conditional structures guide the flow of transformations. Grasp how to integrate these structures seamlessly.
Control Flow | |
---|---|
If/elif/else/endif |
|
Discover how to iterate over lists and dictionaries. Understand simple loop syntax or accessing loop properties.
Looping | |
---|---|
Loop Syntax | |
loop.last | This boolean is False unless the current iteration is the last iteration.
|
loop.first | A boolean that is True if the current iteration is the first iteration, otherwise False.
|
loop.index | An integer representing the current iteration of the loop (1-indexed). So, the first iteration would have loop.index of 1, the second would be 2, and so on.
|
Looping a List |
Compiled code
|
Looping a Dictionary |
|
These logical and comparison operators come in handy, especially when defining tests or setting up configurations in dbt.
Logic Operators | |
---|---|
and |
|
or |
|
not |
|
Comparison Operators | |
---|---|
Equal To | |
Not Equal To |
|
Greater Than |
|
Less Than |
|
Greater Than or Equal to |
|
Less Than or Equal To |
|
Within dbt, you may need to validate if a variable is defined or a if a value is odd or even. These Jinja Variable tests allow you to validate with ease.
Jinja Variable Tests | |
---|---|
Is Defined |
|
Is None |
|
Is Even |
|
Is Odd |
|
Is a String |
|
Is a Number |
|
Macros are the backbone of advanced dbt workflows. Review how to craft these reusable code snippets and also how to enforce data quality with tests.
Creating Macros & Tests | |
---|---|
Define a Macro | Write your macros in your project's macros directory.
|
Use a Macro from a Model | In a model:
Compiled code:
|
Run a Macro from the Terminal | Define in your macros directory. Ex)macros/create_schema_macro.sql:
In Termial: dbt run-operation create_schema --args '{"schema_name": "my_new_schema"}' |
Define a Generic Test | Generic Tests used to be defined in the macros directory. It is now recommended to write your Generic Tests in the tests/generic directory.
|
Use a Generic test | In models/schema.yml add the generic test to the model and column you wish to test.
|
Define a Singular Test | Write your dbt Singular tests in the tests directory and give it a descriptive name. Ex) test/test_suspicious_refunds.sql
|
Fine-tune your dbt data models with these transformation and formatting utilities.
String Manipulation | |
---|---|
Lower |
|
Upper |
|
Default |
|
Trim |
|
Replace |
|
Length |
|
Capitalize |
|
Title |
|
Repeat a String |
|
Substring |
|
Split |
|
Number Manipulation | |
---|---|
Int |
|
Float |
|
Rounding to Nearest Whole Number |
|
Rounding to a Specified Decimal Place |
|
Rounding Down (Floor Method) |
|
Rounding Up (Ceil Method) |
|
Please contact us with any errors or suggestions.
dbt (data build tool) is a powerful data transformation tool that allows data analysts and engineers to transform data in their warehouse more effectively. It enables users to write modular SQL queries, which it then runs on top of the data warehouse; this helps to streamline the analytics engineering workflow by leveraging the power of SQL. In addition to this, dbt incorporates principles of software engineering, like modularity, documentation and version control.
Before we jump into the list of dbt alternatives it is important to distinguish dbt Core from dbt Cloud. The primary difference between dbt Core and dbt Cloud lies in their execution environments and additional features. dbt Core is an open-source package that users can run on their local systems or orchestrate with their own scheduling systems. It is great for developers comfortable with command-line tools and custom setup environments. On the other hand, dbt Cloud provides a hosted service with dbt core as its base. It offers a web-based interface that includes automated job scheduling, an integrated IDE, and collaboration features. It offers a simplified platform for those less familiar with command-line operations and those with less complex platform requirements.
You may be searching for alternatives to dbt due to preference for simplified platform management, flexibility to handle your organization’s complexity, or other specific enterprise needs. Rest assured because this article explores ten notable alternatives that cater to a variety of data transformation requirements.
We have organized these dbt alternatives into 3 groups: dbt Cloud alternatives, code based dbt alternatives , and GUI based dbt alternatives.
dbt Cloud is a tool that dbt Labs provides, there are a few things to consider:
Although dbt Cloud can help teams get going quickly with dbt, it is important to have a clear understanding of the long-term vision for your data platform and get a clear understanding of the total cost of ownership. You may be reading this article because you are still interested in implementing dbt but want to know what your options are other than dbt Clould.
Datacoves is tailored specifically as a seamless alternative to dbt Cloud. The platform integrates directly with existing cloud data warehouses, provides a user-friendly interface that simplifies the management and orchestration of data transformation workflows with Airflow, and provides a preconfigured VS Code IDE experience. It also offers robust scheduling and automation with managed Airflow, enabling data transformations with dbt to be executed based on specific business requirements.
Flexibility and Customization: Datacoves allows customization such as enabling VSCode extensions or adding any Python library. This flexibility is needed when adapting to dynamic business environments and evolving data strategies, without vendor lock-in.
Handling Enterprise Complexity: Datacoves is equipped with managed Airflow, providing a full-fledged orchestration tool necessary for managing complex end-to-end ELT pipelines. This ensures robust data transformation workflows tailored to specific business requirements. Additionally, Datacoves does not just support the T (transformations) in the ELT pipeline, the platform spans across the pipeline by helping the user tie all the pieces together. From initial data load to post-transformation operations such as pushing data to marketing automation platforms.
Cost Efficiency: Datacoves optimizes data processing and reduces operational costs associated with data management as well as the need for multiple SaaS contracts. Its pricing model is designed to scale efficiently.
Data Security and Compliance: Datacoves is the only commercial managed dbt data platform that supports VPC deployment in addition to SaaS, offering enhanced data security and compliance options. This ensures that sensitive data is handled within a secure environment, adhering to enterprise security standards. A VPC deployment is advantageous for some enterprises because it helps reduce the red tape while still maintaining optimal security.
Open Source and Reduced Vendor Lock-In: Datacoves bundles a range of open-source tools, minimizing the risk of vendor lock-in associated with proprietary features. This approach ensures that organizations have the flexibility to switch tools without being tied to a single vendor.
It is worth mentioning that that because dbt Core is open source a DIY approach is always an option. However, opting for a DIY solution requires careful consideration of several factors. Key among these is assessing team resources, as successful implementation and ongoing maintenance of dbt Core necessitate a certain level of technical expertise. Additionally, time to production is an important factor; setting up a DIY dbt Core environment and adapting it to your organization’s processes can be time-consuming.
Finally, maintainability is essential- ensuring that the dbt setup continues to meet organizational needs over time requires regular updates and adjustments. While a DIY approach with dbt Core can offer customization and control, it demands significant commitment and resources, which may not be feasible for all organizations.
This is a very flexible approach because it will be made in-house and with all the organization’s needs in mind but requires additional time to implement and increases the total cost of long-term ownership.
For organizations seeking a code-based data transformation alternative to dbt, there are two contenders they may want to consider.
SQLMesh is an open-source framework that allows for SQL or python-based data transformations. Their workflow provides column level visibility to the impact of changes to downstream models. This helps developers remediate breaking changes. SQLMesh creates virtual data environments that also eliminate the need to calculate data changes more than once. Finally, teams can preview data changes before they are applied to production.
SQLMesh allows developers to create accurate and efficient pipelines with SQL. This tool integrates well with tools you are using today such as Snowflake, and Airflow. SQLMesh also optimizes cost savings by reusing tables and minimizing computation.
Dataform enables data teams to manage all data operations in BigQuery. These operations include creating table definitions, configuring dependencies, adding column descriptions, and configuring data quality assertions. It also provides version control and integrates with GitLab or GitHub.
Dataform is a great option for those using BigQuery because it fosters collaboration among data teams with strong version control and development practices directly integrated into the workflow. Since it keeps you in BigQuery, it also reduces context switching and centralizes data models in the warehouse, improving efficiency.
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. It automates the provisioning of ETL code. It is worth noting that Amazon Glue offers GUI elements (like Glue Studio).
AWS Glue provides flexible support for various pipelines such as ETL, ELT, batch and more, all without a vendor lock-in. It also scales on demand, offering a pay-as-you-go billing. Lastly, this all-in-one platform has tools to support all data users from the most technical engineers to the non-technical business users.
While experience has taught us that there is no substitute for a code-based data transformation solution. Some organizations may opt for a graphical user interface (GUI) tool. These tools are designed with visual interfaces that allow users to drag and drop components to build data integration and transformation workflows. Ideal for users who may be intimidated by a code editor like VS Code, graphical ETL tools may simplify data processes in the short term.
Matillion is a cloud-based data integration platform that allows organizations to build and manage data pipelines and create no-code data transformations at scale. The platform is designed to be user-friendly, offering a graphical interface where users can build data transformation workflows visually.
Matillion simplifies the ETL process with a drag-and-drop interface, making it accessible for users without deep coding knowledge. It also supports major cloud data warehouses like Amazon Redshift, Google BigQuery, and Snowflake, enhancing scalability and performance.
Informatica offers extensive data integration capabilities including ETL, hundreds of no code connectors cloud connectors, data masking, data quality, and data replication. It also uses a metadata-driven approach for data integration. In addition, it was built with performance, reliability, and security in mind to protect your valuable data.
Informatica enhances enterprise scalability and supports complex data management operations across various data types and sources. Informatica offers several low-code and no-code features across its various products, particularly in its cloud services and integration tools. These features are designed to make it easier for users who may not have deep technical expertise to perform complex data management tasks.
Alteryx allows you to automate your analytics at scale. It combines data blending, advanced analytics, and data visualization in one platform. It offers tools for predictive analytics and spatial data analysis.
Alteryx enables users to perform complex data analytics with AI. It also improves efficiency by allowing data preparation, analysis, and reporting to be done within a single tool. It can be deployed on-prem or in the cloud and is scalable to meet enterprise needs.
Azure Data Factory is a fully managed, serverless data integration service that integrates with various Azure services for data storage and data analytics. It provides a visual interface for data integration workflows which allows you to prepare data, construct ETL and ELT processes, and orchestrate and monitor pipelines code-free.
Azure Data Factory can be beneficial for users utilizing various Azure services because it allows seamless integration with other Microsoft products, which is ideal for businesses already invested in the Microsoft ecosystem. It also supports a pay-as-you-go model.
Talend offers an end-to-end modern data management platform with real-time or batch data integration as well as a rich suite of tools for data quality, governance, and metadata management. Talend Data Fabric combines data integration, data quality, and data governance into a single, low-code platform.
Talend can enhance data quality and reliability with built-in tools for data cleansing and validation. Talend is a cloud-independent solution and supports cloud, multi-cloud, hybrid, or on-premises environments.
SQL Server Integration Services are a part of Microsoft SQL Server, providing a platform for building enterprise-level data integration and data transformations solutions. With this tool you can extract and transform data from a wide variety of sources such as XML data files, flat files, and relational data sources, and then load the data into one or more destinations. It Includes graphical tools and wizards for building and debugging packages.
SQL Server Integration Services are ideal for organizations heavily invested in SQL Server environments. They offer extensive support and integration capabilities with other Microsoft services and products.
While we believe that code is the best option to express the complex logic needed for data pipelines, the dbt alternatives we covered above offer a range of features and benefits that cater to different organizational needs. Tools like Matillion, Informatica, and Alteryx provide graphical interfaces for managing ETL processes, while SQLMesh, and Dataform offer code-based approaches to SQL and Python based data transformation.
For those specifically looking for a dbt Cloud alternative, Datacoves stands out as a tailored, flexible solution designed to integrate seamlessly with modern data workflows, ensuring efficiency and scalability.
dbt is wildly popular and has become a fundamental part of many data stacks. While it’s easy to spin up a project and get things running on a local machine, taking the next step and deploying dbt to production isn’t quite as simple.
In this article we will discuss options for deploying dbt to production, comparing some high, medium, and low effort options so that you can find which works best for your business and team. You might be deploying dbt using one of these patterns already; if you are, hopefully this guide will help highlight some improvements you can make to your existing deployment process.
We're going to assume you know how to run dbt on your own computer (aka your local dbt setup). We’re also going to assume that you either want to or need to run dbt in a “production” environment – a place where other tools and systems make use of the data models that dbt creates in your warehouse.
The deployment process for dbt jobs extends beyond basic scheduling and involves a multifaceted approach. This includes establishing various dbt environments with distinct roles and requirements, ensuring the reliability and scalability of these environments, integrating dbt with other tools in the (EL)T stack, and implementing effective scheduling strategies for dbt tasks. By focusing on these aspects, a comprehensive and robust dbt deployment strategy can be developed. This strategy will not only address current data processing needs but also adapt to future challenges and changes in your data landscape, ensuring long-term success and reliability.
In deploying dbt you have the creation and management of certain dbt environments. The development environment is the initial testing ground for creating and refining dbt models. It allows for experimentation without impacting production data. Following this, the testing environment, including stages like UAT and regression testing, rigorously evaluates the models for accuracy and performance. Finally, the production environment is where these models are executed on actual data, demanding high stability and performance.
Reliability and scalability of data models are also important. Ensuring that the data models produce accurate and consistent results is essential for maintaining trust in your data. As your data grows, the dbt deployment should be capable of scaling, handling increased volumes, and maintaining performance.
Integration with other data tools and systems is another key aspect. A seamless integration of dbt with EL tools, data visualization platforms, and data warehouses ensures efficient data flow and processing, making dbt a harmonious component of your broader data stack.
Effective dbt scheduling goes beyond mere time-based scheduling. It involves context-aware execution, such as triggering jobs based on data availability or other external events. Managing dependencies within your data models is critical to ensure that transformations occur in the correct sequence. Additionally, adapting to varying data loads is necessary to scale resources effectively and maintain the efficiency of dbt job executions.
They each have their place, and the trade-offs between setup costs and long-term maintainability is important to consider when you’re choosing one versus another.
Cron jobs are scripts that run at a set schedule. They can be defined in any language. For instance, we can use a simple bash script to run dbt. It’s just like running the CLI commands, but instead of you running them by hand, a computer process would do it for you.
Here’s a simple cron script:
In order to run on schedule, you’ll need to add this file to your system’s crontab.
As you can tell, this is a very basic dbt run script; we are doing the bare minimum to run the project. There is no consideration for tagged models, test, alerting, or more advanced checks.
Even though Cron jobs are the most basic way to deploy dbt there is still a learning curve. It requires some technical skills to set up this deployment. Additionally, because of its simplicity, it is pretty limited. If you are thinking of using crons for multi-step deployments, you might want to look elsewhere.
While it's relatively easy to set up a cron job to run on your laptop this defeats the purpose of using a cron altogether. Crons will only run when the daemon is running, so unless you plan on never turning off your laptop, you’ll want to set up the cron on an EC2 instance (or another server). Now you have infrastructure to support and added complexity to keep in mind when making changes. Running a cron on an EC2 instance is certainly doable, but likely not the best use of resources. Just because it can be done does not mean it should be done. At this point, you’re better off using a different deployment method.
The biggest downside, however, is that your cron script must handle any edge cases or errors gracefully. If it doesn’t, you might wind up with silent failures – a data engineer’s worst enemy.
Cron jobs might serve you well if you have some running servers you can use, have a strong handle on the types of problems your dbt runs and cron executions might run into, and you can get away with a simple deployment with limited dbt steps. It is also a solid choice if you are running a small side-project where missed deployments are probably not a big deal.
Use crons for anything more complex, and you might be setting yourself up for future headaches.
Ease of Use / Implementation – You need to know what you’re doing
Required Technical Ability – Medium/ High
Configurability – High, but with the added complexity of managing more complex code
Customization – High, but with a lot of overhead. Best to keep things very simple
Best for End-to-End Deployment - Low.
Cloud Service Runners like dbt Cloud are probably the most obvious way to deploy your dbt project without writing code for those deployments, but they are not perfect.
dbt Cloud is a product from dbt Labs, the creators of dbt. The platform has some out-of-the-box integrations, such as Github Actions and Webhooks, but anything more will have to be managed by your team. While there is an IDE (Integrated Developer Experience) that allows the user to write new dbt models, you are adding a layer of complexity by orchestrating your deployments in another tool. If you are only orchestrating dbt runs, dbt Cloud is a reasonable choice – it's designed for just that.
However, when you want to orchestrate more than just your dbt runs – for instance, kickoff multiple Extract / Load (EL) processes or trigger other jobs after dbt completes – you will need to look elsewhere.
dbt Cloud will host your project documentation and provide access to its APIs. But that is the lion’s share of the offering. Unless you spring for the Enterprise Tier, you will not be able to do custom deployments or trigger dbt runs based on incoming data with ease.
Deploying your dbt project with dbt Cloud is straightforward, though. And that is its best feature. All deployment commands use native dbt command line syntax, and you can create various "Jobs" through their UI to run specific models at different cadences.
If you are a data team with data pipelines that are not too complex and you are looking to handle dbt deployments without the need for standing up infrastructure or stringing together advanced deployment logic, then dbt Cloud will work for you. If you are interested in more complex triggers to kickoff your dbt runs - for instance, triggering a run immediately after your data is loaded – there are other options which natively support patterns like that. The most important factor is the complexity of the pieces you need to coordinate, not necessarily the size of your team or organization.
Overall, it is a great choice if you’re okay working within its limitations and support a simple workflow. As soon as you reach any scale, however, the cost may be too high.
Ease of Use / Implementation – Very easy
Required Technical Ability – Low
Configurability – Low / Medium
Customization – Low
Best for End-to-End Deployment - Low
The Modern Data Stack is a composite of tools. Unfortunately, many of those tools are disconnected because they specialize in handling one of the steps in the ELT process. Only after working with them do you realize that there are implicit dependencies between these tools. Tools like Datacoves bridge the gaps between the tools in the Modern Data Stack and enable some more flexible dbt deployment patterns. Additionally, they cover the End-to-End solution, from Extraction to Visualization, meaning it can handle steps before and after Transformation.
If you are loading your data into Snowflake with Fivetran or Airbyte, your dbt runs need to be coordinated with those EL processes. Often, this is done by manually setting the ETL schedule and then defining your dbt run schedule to coincide with your ETL completion. It is not a hard dependency, though. If you’re processing a spike in data or running a historical re-sync, your ETL pipeline might take significantly longer than usual. Your normal dbt run won’t play nicely with this extended ETL process, and you’ll wind up using Snowflake credits for nothing.
This is a common issue for companies moving from early stage / MVP data warehouses into more advanced patterns. There are ways to connect your EL processes and dbt deployments with code, but Datacoves makes it much easier. Datacoves will trigger the right dbt job immediately after the load is complete. No need to engineer a solution yourself. The value of the Modern Data Stack is being able to mix and match tools that are fit for purpose.
Meeting established data freshness and quality SLAs is challenging enough, but with Datacoves, you’re able to skip building custom solutions for these problems. Every piece of your data stack is integrated and working together. If you are orchestrating with Airflow, then you’re likely running a Docker container which may or may not have added dependencies. That’s one common challenge teams managing their own instances of Airflow will meet, but with Datacoves, container / image management and synchronization between EL and dbt executions are all handled on the platform. The setup and maintenance of the scalable Kubernetes infrastructure necessary to run Airflow is handled entirely by the Datacoves platform, which gives you flexibility but with a lower learning curve. And, it goes without saying that this works across multiple environments like development, UAT, and production.
With the End-to-End Pipeline in mind, one of the convenient features is that Datacoves provides a singular place to access all the tools within your normal analytics workflow - extraction + load, transformation, orchestration, and security controls are in a single place. The implicit dependencies are now codified; it is clear how a change to your dbt model will flow through to the various pieces downstream.
Datacoves is for teams who want to introduce a mature analytics workflow without the weight of adopting and integrating a new suite of tools on their own. This might mean you are a small team at a young company, or an established analytics team at an enterprise looking to simplify and reduce platform complexity and costs.
There are some prerequisites, though. To make use of Datacoves, you do need to write some code, but you’ll likely already be used to writing configuration files and dbt models that Datacoves expects. You won't be starting from scratch because best practices, accelerators, and expertise are already provided.
Ease of Use / Implementation – You can utilize YAML to generate DAGS for a simpler approach, but you also have the option to use Python DAGS for added flexibility and complexity in your pipelines.
Required Technical Ability – Medium
Configurability – High
Customization – High. Datacoves is modular, allowing you to embed the tools you already use
Best for End-to-End Deployment - High. Datacoves takes into account all of the factors of dbt Deployment
What do you use to deploy your dbt project when you have a large, complex set of models and dependencies? An orchestrator like Airflow is a popular choice, with many companies opting to use managed deployments through services such as Astronomer.
For many companies – especially in the enterprise – this is familiar territory. Adoption of these orchestrators is widespread. The tools are stable, but they are not without some downsides.
These orchestrators require a lot of setup and maintenance. If you’re not using a managed service, you’ll need to deploy the orchestrator yourself, and handle the upkeep of the infrastructure running your orchestrator, not to mention manage the code your orchestrator is executing. It’s no small feat, and a large part of the reason that many large engineering groups have dedicated data engineering and infrastructure teams.
Running your dbt deployment through Airflow or any other orchestrator is the most flexible option you can find, though. The increase in flexibility means more overhead in terms of setting up the systems you need to run and maintain this architecture. You might need to get DevOps involved, you’ll need to move your dbt project into a Docker image, you’ll want an airtight CI/CD process, and ultimately have well defined SLAs. This typically requires Docker images, container management, and some DevOps work. There can be a steep learning curve, especially if you’re unfamiliar with what’s needed to take an Airflow instance to a stable production release.
There are 3 ways to run Airflow, specifically – deploying on your own, using a managed service, or using an integrated platform like Datacoves. When using a managed service or an integrated platform like Datacoves, you need to consider a few factors:
Airflow is a multi-purpose tool. It’s not just for dbt deployments. Many organizations run complex data engineering pipelines with Airflow, and by design, it is flexible. If your use of Airflow extends well beyond dbt deployments or ELT jobs oriented around your data warehouse, you may be better suited for a dedicated managed service.
Similarly, if your organization has numerous teams dedicated to designing, building and maintaining your data infrastructure, you may want to use a dedicated Airflow solution. However, not every organization is able to stand up platform engineering teams or DevOps squads dedicated to the data infrastructure. Regardless of the size of your team, you will need to make sure that your data infrastructure needs do not outmatch your team’s ability to support and maintain that infrastructure.
Every part of the Modern Data Stack relies on other tools performing their jobs; data pipelines, transformations, data models, BI tools - they are all connected. Using Airflow for your dbt deployment adds another link in the dependency chain. Coordinating dbt deployments via Airflow can always be done through writing additional code, but this is an additional overhead you will need to design, implement, and maintain. With this approach, you begin to require strong software engineering and design principles. Your data models are only as useful as your data is fresh; meeting your required SLAs will require significant cross-tool integration and customization.
If you are a small team looking to deploy dbt, there are likely better options. If you are a growing team, there are certainly simpler options with less infrastructure overhead. For Data teams with complex data workflows that combine multiple tools and transformation technologies such as Python, Scala, and dbt, however, Airflow and other orchestrators can be a good choice.
Ease of Use / Implementation – Can be quite challenging starting from scratch
Required Technical Ability – High
Configurability – High
Customization – High, but build time and maintenance costs can be prohibitive
Best for End-to-End Deployment - High, but requires a lot of resources to set up and maintain
Cron Job | Cloud Runner | Integrated Platform | Fully Custom Deployment | |
---|---|---|---|---|
Ease of Use/ Implementation | Need to know what you’re doing and only good for simple use cases. | Very easy to start up | Simple using YAML with ability to build more complex pipelines with Python. | Challenging for most |
Required Technical Ability | Medium/ High | Low | Medium | High |
Configurability | High | Low / Medium | High | High |
Customization | High | Low | High | High |
End-to-End Deployment | Low | Low | High | High |
The way you should deploy your dbt project depends on a handful of factors – how much time you’re willing to invest up front, your level of technical expertise, as well as how much configuration and customization you need.
Small teams might have high technical acumen but not enough capacity to manage a deployment on their own. Enterprise teams might have enough resources but maintain disparate, interdependent projects for analytics. Thankfully, there are several options to move your project beyond your local and into a production environment with ease. And while specific tools like Airflow have their own pros and cons, it’s becoming increasingly important to evaluate your data stack vendor solution holistically. Ultimately, there are many ways to deploy dbt to production, and the decision comes down to spending time building a robust deployment pipeline or spending more time focusing on analytics.
Jinja is the game changing feature of dbt Core that allows us to create dynamic SQL code. In addition to the standard Jinja library, dbt Core includes additional functions and variables to make working with dbt even more powerful out of the box.
See our original post, The Ultimate dbt Jinja Cheat Sheet, to get started with Jinja fundamentals like syntax, variable assignment, looping and more. Then dive into the information below which covers Jinja additions added by dbt Core.
This cheatsheet references the extra functions, macros, filters, variables and context methods that are specific to dbt Core.
Enjoy!
These pre-defined dbt Jinja functions are essential to the dbt workflow by allowing you to reference models and sources, write documentation, print, write to the log and more.
Functions | |
---|---|
ref(project_or_package, model_name, version=latest_version) | This function allows you to reference models within or accross dbt projects. The version and project_or_package parameters are optional.
|
source(source_name, table_name ) | This function allows you to create a dependency between the current model and a source.
|
log(msg, info=False) | This function writes a message to the log and/or stdout. To write to the both the log file and stdout set the info arg to True. If set to False, the function will only write to the log file.
|
print("string to print") | Use this function when you wish to print messages to the log and stdout. This is the same as log() with info=True.
|
doc('doc_block_name') | The doc function allows you reference {% docs blocks %} from markdown (.md) files in the description field of property files(.yml). It is important to note that the doc() function is how you reference the contents of a {% docs block %}.
The {% docs block %} can be used to override preset overviews.
|
return(data) | Use this function to return data to the caller. This function will preserve the data type: dict, list, int etc.
|
env_var('env_var_name', 'default_value') | env_var is used to incorporate environment variables anywhere you can use Jinja code.
Use the optional second argument to set a default and avoid compilation errors.
|
var('variable_name') | Use when configuring packages in multiple environments or defining values across multiple models in a package. For variables that rarely change, define them in the dbt_project.yml file. Variables that change frequently can be defined through the CLI.
|
These macros are provided in dbt Core to improve the dbt workflow.
Macros | |
---|---|
run_query() |
This macro allows you to run SQL queries and store their results. It returns a table object with the query result.
|
debug() | Call this macro within a macro to open up a python debugger. The DBT_MACRO_DEBUGGING environment variable must be set. The debug macro is only available in dbt Core/Datacoves. Only use {{ debug }} in a development environment.
|
These dbt Jinja filters are used to modify data types.
Filters | |
---|---|
as_bool |
This Jinja filter will coerce the output into a boolean value. If the conversion can’t happen then it will return an error.
|
as_number |
This Jinja filter will coerce the output into an numeric value. If the conversion can’t happen then it will return an error.
|
as_native |
This Jinja filter will return the native python type (set, list, tuple, dict, etc). Best for iterables, use as_bool or as_number for boolean or numeric values.
|
These dbt core "variables" such as config, target, source, and others contain values that provide contextual information about your dbt project and configuration settings. They are typically used for accessing and customizing the behavior of your dbt models based on project-specific and environment-specific information.
Project Context Variables | |
---|---|
adapters |
dbt uses the adapter to communicate with the database. Setup correctly using the Database specific adapter i.e Snowflake, RedShift. The adapter has methods that will be translated into SQL statements specific to your database. |
builtins | The builtins variable is a dictionary containing keys for dbt context methods ref, source, and config.
|
config |
The config variable allows you to get configuration set for dbt models and set constraints that assure certain configurations must exist for a given model:
config.get('<config_key>', default='value'): Fetches a configuration named <config_key> for example "materialization". - If no default is provided or the configuration item is not defined, it this will return None
config.require("<config_key>"): Strictly requires a key named <config_key> is defined in the configuration.
- Throws error if not set.
config.get and config.require are commonly seen in the context of custom materializations, however, they can be used in other macros, if those macros are used within a model, seed, or snapshot context and they have the relevant configurations set.
|
dbt_version |
The dbt_version variable is helpful for debugging by returning the installed dbt version. This is the dbt version running not what you define in your project. e.g. If you make a project with dbt 1.3 and run it on another machine with dbt 1.6, this will say 1.6 |
execute |
The execute variable is set to True when dbt runs SQL against the databse such as when executing dbt run. When running dbt commands such as dbt compile where dbt parses your project but no SQL is run against the database execute is set to False. This variable is helpful when your Jinja is relying on a result from the database. Wrap the jinja in an if statement.
|
flags | This variable holds the value of the flags passed in the command line such as FULL_REFRESH, STORE_FAILURES, and WHICH(compile, run, build, run-operation, test, and show). It allows you to set logic based on run modes and run hooks based on current commands/type.
|
graph |
The graph variable is dictionary that hold the information about the nodes(Models, Sources, Tests, Snapshots) in your dbt project. |
model |
This is the graph object of the current model. This object allows you to access the contents of a model, model structure and JSON schema, config settings, and the path to the model. |
modules | This variable contains Python Modules for working with data including:datetime, pytz, re, and itertools.
|
project_name |
This variable returns the name for the root-level project that is being run. |
target |
This variable contains information about your dbt profile target, such as your warehouse connection information. Use the dot notation to access more such as: target.name or target.schema or target.database, ect |
These special variables provide information about the current context in which your dbt code is running, such as the model, schema, or project name.
Run Context Variablels | |
---|---|
database_schemas |
Only available in the context for on-run-end. This variable allows you to reference the databases and schemas. Useful if using multiple different databases |
invocation_id | This function outputs a UUID every time you run or compile your dbt project. It is useful for auditing. You may access it in the query-comment, info dictionary in events and logs, and in the metadata dictionary in dbt artifacts. |
results |
Only available in the context for on-run-end. This variable contains a list of Results Objects. Allows access to the information populated in run results JSON artifact. |
run_started_at |
This variable outputs a timestamp for the start time of a dbt run and defaults to UTC. It Is a python datetime object. Use standard strftime formatting.
|
schemas |
Only available in the context for on-run-end. This variable allows you to reference a list of schemas for models built during a dbt run. Useful for granting privileges. |
selected_resources |
This variable allows you to access a list of selected nodes from the current dbt command. The items in the list depend on the parameters of —select, —exclude,—selector. |
this |
{{ this }} is the database representation of the current model. Use the dot notation to access more properties such as: {{ this.database }} and {{ this.schema }}. |
thread_id | The thread_id is a unique identifier assigned to the Python threads that are actively executing tasks, such as running code nodes in dbt. It typically takes the form of names like "Thread-1," "Thread-2," and so on, distinguishing different threads. |
These methods allow you to retrieve information about models, columns, or other elements of your project.
Context Methods | |
---|---|
set(value, default) | Allows you to use the python method set(), which converts an iterable into a unique set of values. This is NOT the same as the jinja expression set which is used to assign a value to a variable. This will return none if the value entered is not an iterable. In the example below both the python set() method and the jinja set expression are used to remove a duplicate element in the list.
|
set_strict(value) | Same as the set method above however it will raise a TypeError if the entered value is not a valid iterable.
|
exceptions | Is used to raise errors and warnings in a dbt run:
raise_compiler_error will raise an error, print out the message. Model will FAIL.
warn will raise a compiler warning and print out the set method. Model will still PASS.
|
fromjson(string, default) | Is used to deserialize a JSON string into a python dict or list. |
fromyaml(string, default) | Is used to deserialize a YAML string into a python dict or list. |
tojson(value, default) | Serializes a Python dict or list to a JSON string. |
toyaml(value, default ) | Serializes a Python dict or list to a YAML string. |
local_md5 |
This variable locally creates an MD5 hash of the given string.
|
zip(*args, default) | This method is allows you to combine any number of iterables.
|
zip_strict(value) | Same as the zip method but will raise a TypeError if one of the given values are not a valid iterable.
|
Please contact us with any errors or suggestions.
If you've taken an interest in dbt (data build tool) and are on the fence about whether to opt for dbt Cloud or dbt Core, you're in the right place. Perhaps you're already using one of the dbt platforms and are considering a change. Regardless of your current position, understanding the differences of these options is crucial for making an informed decision. In this article, we'll delve deep into the key distinctions between dbt Cloud and dbt Core.
For those new to the dbt community, navigating the terminology can be a tad confusing. "dbt," "dbt Core," and "dbt Cloud" may sound similar but each represents a different facet of the dbt ecosystem. Let's break it down.
dbt is the generic name for the open-source tool and when people say dbt the features are mainly those in dbt Core. dbt allows users to write, document, and execute SQL-based transformations, making it easier to produce reliable and up-to-date analytics. By facilitating practices like version control, testing, and documentation, dbt enhances the analytics engineering workflow, turning raw data into actionable insights.
Once you decide dbt is right for your organization, the next step is to determine how you'll access dbt. The two most prevalent methods are dbt Core and dbt Cloud. While dbt Cloud offers an enhanced experience with additional features, its abstraction can sometimes limit the desired flexibility and control over the workflow especially when it comes to using dbt with the complexities of an enterprise.
Throughout this article we'll observe that by using dbt Core and incorporating other tools, you can achieve many of the same functionalities as dbt Cloud while maintaining flexibility and control. While this approach offers enhanced flexibility, it consequently introduces increased complexity, maintenance, and an added workload. When adopting a dbt platform it is important to understand the tradeoffs to truly know what will work best for your data team.
dbt Core is an open-source data transformation tool that enables data analysts and engineers to transform and model data to derive business insights. dbt Core is the foundational, open-source version of dbt that provides users with the utmost flexibility. The term "flexible" implies that users have complete autonomy over its implementation, integration, and configuration within their projects.
Even though dbt Core is free, to meet or exceed the functionality of dbt Cloud, it will need to be paired with additional tooling as we will discuss below.These open source solutions may be leverage at no cost, but this increases the platform maintenance overhead and may impact the total cost of ownership and the platform's time to market. Alternatively, managed dbt Core platforms exist, like Datacoves, which simplify this process.
Using and installing dbt Core is done manually. Depending on which data warehouse you are using, you select the appropriate dit adapter such as dit-snowflake, dbt-databricks, dt-redshift, etc. You can see all available dbt adapters on our dbt libraries. If you are using Snowflake you can check out our detailed Snowflake with dbt getting started guide.
Given that you have installed the pre-requisites, installing dbt is just a matter of installing dbt-snowflake.
dbt Cloud is a hosted dbt platform to develop and deploy dbt projects. dbt Cloud leverages all the power of dbt Core with some extra features such as a proprietary Web-based UI, a dbt job scheduler, APIs, integration with Continuous Integrations platforms like Github Actions, and a proprietary Semantic layer. dbt Cloud's features are all intended to facilitate the dbt workflow.
dbt cloud pricing has three tiers: Enterprise, Team and Developer. Developer is a free tier meant for a single developer with a hard limit of 3000 model runs per month. The Team Plan pricing starts at $100 per developer for teams up to 8 with 15,000 successful models built per month; any additional models will cost $0.01.
When it comes to the Integrated Development Environment (IDE), both dbt Cloud and dbt Core present distinct advantages and challenges. Whether you prioritize flexibility, ease of setup, or a blend of both, your choice will influence how your team develops, tests, and schedules your data transformations. Let's explore how each option handles the IDE aspect and the impact on developers and analytic engineers.
In the instance of IDEs, using dbt Core requires setting up a dev environment on each member's device or a virtual space like AWS workspace. This involves installing a popular dbt IDE such as VS Code, dbt Core, connecting to a data warehouse, and handling dependencies like Python versions.
Enterprise dbt setups typically include additional dependencies to enhance productivity. Some notable VS Code extensions for this include dbt Power User, SQLFluff, and the official dbt Snowflake VS Code extension.
When companies are ramping up with dbt, one of the pain points is setting up and managing dbt IDE environments. Analytic Engineers coming to dbt may not be familiar with concepts like version control with git or using the command line. The dbt Cloud IDE simplifies developer onboarding by providing a web-based SQL IDE to team members so they can easily write, test, and refine data transformation code without having to install anything on their computers. Complexities like starting a git branch are tucked behind a big colorful button so users know that is the first step in their new feature journey.
However, Developers who are accustomed to more versatile local IDEs, such as VS Code, may find the dbt Cloud experience limiting as they cannot leverage extensions such as those from the VS Code Marketplace nor can they extend dbt Core using the vast array of Python libraries.
It is possible to get the best of both worlds - the flexibility of dbt Core in VS Code and the quick setup that dbt Cloud Offers - with a Managed dbt Core Platform like Datacoves. In a best-in-class developer setup, new users are onboarded in minutes with simple configuration screens that remove the need to edit text files such as profiles.yml and remove the complexity of creating and managing SSH keys. Version upgrades of dbt or any dependent library should be transparent to users. Spinning up a pristine environment should be a matter of clicks.
Scheduling in a dbt project is crucial for ensuring timely and consistent data updates. It's the backbone of reliable and up-to-date analytics in a dbt-driven environment.
While an orchestrator does not come out of the box with dbt Core, when setting up a deployment environment companies can leverage any orchestration tool, such as Airflow, Dagster, or Prefect. They can connect steps prior to or after the dbt transformations and they can trigger any tool that exists within or outside the corporate network.
dbt Cloud makes deploying a dbt Core project simple. It allows you to define custom environment variables and the specific dbt commands (seed, run, test) that you want to run during production runs. The dbt Cloud scheduler can be configured to trigger at specific intervals using an intuitive UI.
dbt Cloud is primarily focused on running dbt projects. Therefore, if a data pipeline has more dependencies, an external orchestration tool may be required. Fortunately, if you do use an external orchestrator, dbt Cloud offers an API to trigger dbt Cloud jobs from your orchestrator.
DataOps emphasizes automating the integration of code changes, ensuring that data transformations are consistently robust and reliable. Both platforms approach CI/CD differently. How seamless is the integration? How does each platform handle tool compatibility?
When using dbt Core for your enterprise data platform, you will need to not only define and configure the automation scripts, but you will also need to ensure that all the components, such as a git server, CI server, CI runners, etc. are all working harmoniously together.
Since dbt Core can be run within the corporate firewall, it can be integrated with any CI tool and internal components such as Jira, Bitbucket, and Jenkins. To do this well, all the project dependencies must be packed into reusable Docker containers. Notifications will also need to be defined across the various components and all of this will take time and money.
dbt Cloud has built in CI capabilities which reduce the need for third party tools. dbt Cloud can also be paired with Continuous Integration (CI) tools like GitHub Actions to validate data transformations before they are added to a production environment. Aspects such as code reviews and approvals will occur in the CI/CD tool of choice such as GitHub and dbt Cloud can report job execution status back to GitHub Actions. This allows teams to know when it is safe to merge changes to their code. One item to note is that each successful model run in your CI run will count against the monthly model runs as outlined in the dbt Cloud pricing.
Companies that have tools like Bitbucket, Jira, and Jenkins within their corporate firewall may find it challenging to integrate with dbt Cloud.
A semantic layer helps businesses define important metrics like sales, customer churn, and customer activations with the flexibility to aggregate at run time. These metrics can be referenced by downstream tools as if they had been previously computed. End-users benefit from the flexibility to aggregate metrics at diverse grains without the company incurring the cost of pre-computing every permutation. These on-the-fly pivots ensure consistent and accurate insights across the organization.
dbt Core does not come with a built-in semantic layer, but there are open source and proprietary alternatives that allow you to achieve the same functionality. These include cube.dev, and Lightdash.
dbt Cloud has been rolling out a proprietary semantic layer which is currently in public preview. This feature is only available to dbt pricing plans Team and Enterprise. When using the dbt Cloud semantic layer your BI tool connects to a dbt Cloud proxy server which sits between the BI tool and your Data Warehouse.
dbt’s semantic layer offers a system where metrics are standardized as dbt metadata, visualized in your DAG, and integrated seamlessly with features like the Metadata API and the dbt proxy server.
Understanding your dbt project's structure and data flow is essential for effective data management and collaboration. While dbt Cloud offers dbt Explorer, a tool that visually maps model dependencies and metadata, it is exclusive to dbt Cloud users.
dbt Docs (dbt docs generate
) is a built-in feature in dbt Core that generates a static documentation site, providing lineage graphs and detailed metadata for models, columns, and tests. However, for larger projects, dbt Docs can struggle with high memory usage and slow load times, making it less practical for extensive datasets. Also, dbt Docs lacks column-level lineage, which is crucial for impact analysis and debugging.
But no worries—dbt Core users can achieve similar, and even better, functionalities through alternative methods. The answer: a data catalog like DataHub. A Data Catalog can significantly enhance not just your dbt exploration, but your entire data project discovery experience!
DataHub Offers:
There is an obvious caveat. Implementing and maintaining an open-source data catalog like DataHub introduces additional complexity. Organizations need to allocate resources to manage, update, and scale the platform effectively. Fortunately, a managed solution like Datacoves simplifies this by providing an integrated offering that includes DataHub, streamlining deployment and reducing maintenance overhead.
APIs play a crucial role in streamlining dbt operations and enhancing extensibility.
With dbt Core, users often rely on external solutions to integrate specific API functionalities.
Administrative API Alternative: There is currently no feature-to-feature alternative for the dbt Cloud administrative API. However, the Airflow API can be leveraged to enqueue runs for jobs which is a primary feature of the dbt Cloud Administrative API.
Discovery API Alternative: This API was formerly known as the dbt Cloud Metadata API. Tools such as Datahub can provide similar functionality. Datahub can consume dbt Core artifacts such as the manifest.json and expose an API for dbt metadata consumption.
Semantic Layer API Alternative: When it comes to establishing and managing the semantic layer, Cube.dev provides a mature, robust, and comprehensive alternative to the dbt Cloud Semantic layer. Cube also has an API tailored for this purpose.
dbt Cloud offers three APIs. These APIs are available to Team and Enterprise customers.
Administrative API: The dbt Cloud Administrative API is designed primarily for tasks like initiating runs from a job, monitoring the progress of these runs, and retrieving artifacts once the jobs have been executed. dbt Cloud is working on additional functionality for this API, such as operational functions within dbt Cloud.
Discovery API: Whenever you run a project in dbt Cloud, it saves details about that project, such as information about your data models, sources, and how they connect. The Discovery API lets you access and understand this saved information. Use cases include: performance, quality, discovery, governance and development.
Semantic Layer API: The dbt Semantic Layer API provides a way for users to interact with their data using a JDBC driver. By using this API, you can easily query metrics values from your data and get insights.
Examining the differences between dbt Core and dbt Cloud reveals that both can lead organizations to similar results. Much of what dbt Cloud offers can be replicated with dbt Core when combined with appropriate additional tools. While this might introduce some complexities, the increased control and flexibility might justify the trade-offs for certain organizations. Thus, when deciding between the two, it's a matter of prioritizing simplicity versus adaptability for the team. This article only covers dbt core vs dbt cloud but you can read more about dbt alternatives in our blog..
As a managed dbt Core solution, the Datacoves platform simplifies the dbt Core experience and retains its inherent flexibility. It effectively bridges the gap, capturing many benefits of dbt Cloud while mitigating the challenges tied to a pure dbt Core setup. See if Datacoves dbt pricing is right for your organization or visit our product page.
The dbt-utils package enhances the dbt experience by offering a suite of utility macros. Designed to tackle common SQL modeling patterns, it streamlines complex operations, allowing users to focus on data transformation rather than the intricacies of SQL. dbt-utils is a must-have tool for dbt aficionados!
The dbt-utils package is a gem in the world of data transformations. Let this cheat sheet guide you swiftly through its features, ensuring you get the most out of dbt-utils. Enjoy!
The SQL generators in the dbt-utils package streamline your modeling tasks. By automating common SQL patterns, they minimize manual coding and guarantee consistent, high-quality queries. Think of it as a handy toolkit for every dbt user's SQL endeavors!
SQL Generators | |
---|---|
date_spine(datepart, start_date, end_date) (view source) |
This macro provides the SQL needed to create a sequence of dates. It starts with the 'start_date' (if it matches the specified date part) but doesn't go up to the 'end_date'. |
deduplicate(relation, partition_by, order_by) (view source) |
This macro produces the SQL needed to delete duplicate rows from a model, source, or CTE. |
haversine_distance(lat1, lon1, lat2, lon2, unit='mi') (view source) |
This macro computes the haversine distance between a pair of x/y coordinates. You can also specify a 'unit' for the distance. |
group_by(n) (view source) |
This macro constructs a "group by" statement for fields ranging from 1 to N. |
star(from, except= [], relation_alias= [], prefix=' ', suffix=' ', quote_identifiers=True) (view source) |
This macro creates a list of fields separated by commas from the 'from' argument, but it leaves out any fields mentioned in the 'except' argument. Further customizations are available with the optional arguments. |
union_relations(relations, exclude, include, column_override, source_column_name='_dbt_source_relation', where) (view source) |
This macro merges the items from the relations argument using 'union all'. It can handle when the columns are in different orders or when some columns are missing in certain Relations. Any unique columns will have 'null' values where they don't exist in some relations. Additionally, a new column, '_dbt_source_relation', is added to show where each record originated. |
generate_series(upper_bound) (view source) |
This macro creates a cross-database way to produce a list of numbers up to a specified maximum. It generates a SQL result set where the numbers start from 1 and go up to the number you've chosen. |
generate_surrogate_key(<field list>) (view source) |
This macro creates a cross-database way to produce a hashed surrogate key based on the fields you specify. Note: Older versions of this macro is called 'surrogate_key()'. Both nulls and blank strings were treated identically. If you need to revert to this behavior for reasons like maintaining backward compatibility, you should add the surrogate_key_treat_nulls_as_empty_strings variable to your 'dbt_project.yml'. |
safe_add(<field list>) (view source) |
This macro cross-database way to sum up fields that can have null values, based on the fields you indicate. |
safe_divide(numerator, denominator) (view source) |
This macro divides values but gives a null result if the bottom number (denominator) is 0. |
safe_subtract(<field list>) (view source) |
This macro implements a cross-database to calculate the difference between fields that might have null values, based on the fields you select. |
pivot(column, values, alias=True, agg= 'sum', cmp='=', prefix= ' ', suffix= ' ', then_value= 1, else_value=0, quote_identifiers=True) (view source) |
This macro transforms data by turning row values into column headers. |
unpivot(relation, cast_to= 'varchar', exclude=none, remove=none, field_name=none, value_name=none, quote_identifiers=False) (view source) |
This macro converts a table from a wide layout to a long layout, similar to the melt function in pandas. It replaces Boolean values with the strings 'true' or 'false'. |
width_bucket(expr, min_value, max_value, num_buckets) (view source) |
This macro creates equal-width histograms by dividing the range into uniform intervals. It then determines the bucket number for a given value after evaluating an expression. The output is either an integer representing the bucket number or null if any input is null. |
Within the dbt-utils package lies a set of generic tests, designed to validate your data effortlessly. These tests ensure consistency and quality, checking for common issues without the need to craft custom validation rules. It's data integrity made simple for dbt users.
Generic Tests | |
---|---|
equal_rowcount(compare_model, group_by_columns) (view source) |
This test compares two models or tables and asserts they have the same number of rows. |
fewer_rows_than(compare_model, group_by_columns) (view source) |
This test asserts that a specific model or table has fewer rows than a specified threshold or compared to another table. |
equality(compare_model, compare_columns, group_by_columns) (view source) |
This test compares two models or tables and asserts their equality. You can specify a subset of columns to compare. |
expression_is_true(expression, where) (view source) |
The test ensures that a given SQL expression holds true for every record, which is particularly helpful in maintaining column integrity. For instance, this can be used to confirm results from simple algebraic operations between columns, validate the length of a column, or assess the truth value of a column. Note: Use the where clause to check a subset of records. |
recency(datepart, field, interval, group_by_columns) (view source) |
This test ensures the timestamp column in the given model has data that's newer than a specific date range. |
at_least_one(group_by_columns) (view source) |
This test checks a specified column and asserts it has at least one value. |
not_constant(group_by_columns) (view source) |
This test makes sure a column doesn't have the same value for every row. |
not_empty_string(trim_whitespace=true) (view source) |
This test asserts that none of the values are equal to an empty string. The optional argument, trim_whitespace will remove whitespace from the column. |
cardinality_equality(field, to) (view source) |
This test asserts that a specific column has the same number of unique values as another column in a different table. |
not_null_proportion(at_least, at_most=1.0, group_by_columns) (view source) |
This test checks that a column has non-empty values within a certain range. If you don't specify an upper limit, it assumes up to 100% (or 1.0) of the values can be non-empty. |
not_accepted_values(values) (view source) |
This test checks that there aren't any rows with the specified values. |
relationships_where(to, field, from_condition, to_condition) (view source) |
This test checks the connection between two models, similar to the basic relationship checks. It also allows for filtering out specific rows, like test entries or recent entries, which might have temporary inconsistencies because of data processing limits. |
mutually_exclusive_ranges(lower_bound_column, upper_bound_column, partition_by=None, gaps='allowed', zero_length_range_allowed=False) (view source) |
By setting the ‘lower_bound_column’ and ‘upper_bound_column’ ranges, This test asserts that the ranges don't overlap with those of other rows. |
sequential_values(interval=1, datepart=None, group_by_columns) (view source) |
This test checks that a column has values in order, either numbers or dates |
unique_combination_of_columns(combination_of_columns, quote_columns=false) (view source) |
This test ensures that when certain columns are combined, their values are unique. For instance, while a month or a product alone might repeat, the pairing of a month with a product is always distinct. |
accepted_range(min_value, max_value, where, Inclusive=true) (view source) |
This test ensures a column's values are within a certain range, set by a minimum and maximum value. You can choose if the range includes or excludes the boundary values. If needed, you can also focus on specific records with a filter. As long as the data type can be compared using > or < signs, you can use this test. |
Grouping in tests | Some tests offer an extra group_by_columns option to get more detailed checks. Arg for group_by_column must be in list format. Tests that support this include: equal_rowcount, fewer_rows_than, recency, at_least_one, not_constant, sequential_values, non_null_proportion. |
The introspective macros within the dbt-utils package are a window into your data's metadata. They empower you to dynamically interact with and understand the underlying structure of your datasets. It's like having a magnifying glass for the intricacies of your dbt projects!
Introspective Macros | |
---|---|
get_column_values(table, column, where=none, order_by='count(*) desc', max_records=none, default=[]) (view source) |
This macro produces an array containing the distinct values of a column from a specified relation. Note: Setting order_by='my_column' will sort alphabetically, while order_by='min(created_at)' will sort by when the value was first observed. |
get_filtered_columns_in_relation(from, except=[]) (view source) |
This macro provides a Jinja list, containing columns from a specific relation. It's meant for actual relations and not for CTEs. |
get_relations_by_pattern(schema_pattern, table_pattern, exclude, database=target.database) (view source) |
This macro produces a list of Relations that align with a specified schema or table name pattern. It pairs well with the 'union_relations' macro. |
get_relations_by_prefix(schema, prefix, exclude, database= target.database) (view source) |
DEPRECATED SOON: Use 'get_relations_by_pattern' instead. This macro gives a list of Relations matching a specific prefix and has an option to exclude certain patterns. It works especially well when combined with 'union_relations'. |
get_query_results_as_dict(<sql query>) (view source) |
This macro provides a dictionary based on a SQL query, eliminating the need to work with the Agate library to process the outcome. |
get_single_value(<sql query>) (view source) |
This macro fetches a single value from a SQL query, allowing you to bypass the use of the Agate library when handling the result. |
Please contact us with any errors or suggestions.
dbt, also known as data build tool, is a data transformation framework that leverages templated SQL to transform and test data. dbt is part of the modern data stack and helps practitioners apply software development best practices on data pipelines. Some of these best practices include code modularity, version control, and continuous testing via its built in data quality framework. In this article we will focus on how data can be tested with dbt via build in functionality and with additional dbt packages and libraries.
Adding tests to workflows does more than ensure code and data integrity; it facilitates a continuous dialogue with your data, enhancing understanding and responsiveness. By integrating testing into your regular workflows, you can:
By embedding testing into the development cycle and consuming the results diligently, teams not only safeguard the functionality of their data transformations but also enhance their overall data literacy and operational efficiency. This proactive approach to testing ensures that the insights derived from data are both accurate and actionable.
In dbt, there are two main categories of tests: data tests and unit tests.
Data tests are meant to be executed with every pipeline run to validate the integrity of the data and can be further divided into two types: Generic tests and Singular tests.
Regardless of the type of data test, the process is the same behind the scenes: dbt will compile the code to a SQL SELECT statement and execute it against your database. If any rows are returned by the query, this indicates a failure to dbt.
Unit tests, on the other hand, are meant to validate your transformation logic. They rely on predefined data for comparison to ensure your logic is returning an expected result. Unlike data tests, which are meant to be run with every pipeline execution, unit tests are typically run during the CI (Continuous Integration) step when new code is introduced. Unit tests were incorporated in dbt Core as of version 1.8.
These are foundational tests provided by dbt-core, focusing on basic schema validation and source freshness. These tests are ideal for ensuring that your data sources remain valid and up-to-date.
dbt-core provides four built-in generic tests that are essential for data modeling and ensuring data integrity:
unique: is a test to verify that every value in a column (e.g. customer_id) contains unique values. This is useful for finding records that may inadvertently be duplicated in your data.
not_null: is a test to check that the values for a given column are always present. This can help you find cases where data in a column suddenly arrives without being populated.
accepted_values: this test is used to validate whether a set of values within a column is present. For example, in a column called payment_status, there can be values like pending, failed, accepted, rejected, etc. This test is used to verify that each row within the column contains one of the different payment statuses, but no other. This is useful to detect changes in the data like when a value gets changed such as accepted being replaced with approved.
relationships: these tests check referential integrity. This type of test is useful when you have related columns (e.g. the customer identifier) in two different tables. One table serves as the “parent” and the other is the “child” table. This is common when one table has a transaction and only lists a customer_id and the other table has the details for that customer. With this test we can verify that every row in the transaction table has a corresponding record in the dimension/details table. For example, if you have orders for customer_ids 1, 2, 3 we can validate that we have information about each of these customers in the customer details table.
Using a generic test is done by adding it to the model's property (yml) file.
Generic tests can accept additional test configurations such as a where clause to apply the test on a subset of rows. This can be useful on large tables by limiting the test to recent data or excluding rows based on the value of another column. Since an error will stop a dbt build or dbt test of the project, it is also possible to assign a severity to a test and optionally a threshold where errors will be treated as warning instead of errors. Finally, since dbt will automatically generate a name for the test, it may be useful to override the auto generated test name for simplicity. Here's the same property file from above with the additional configurations defined.
Singular tests allow for the customization of testing parameters to create tailored tests when the default generic ones (or the ones in the packages discussed below) do not meet your needs. These tests are simple SQL queries that express assertions about your data. An example of this type of test can be a more complex assertion such as having sales for one product be within +/- 10% of another product. The SQL simply needs to return the rows that do not meet this condition.
In dbt, it is also possible to define your own custom generic tests. This may be useful when you find yourself creating similar Singular tests. A custom generic test is essentially the same as a dbt macro which has a least a model as a parameter, and optionally column_name, if the test will apply to a column. Once the generic test is defined, it can be applied many times just like the generic tests shipped with dbt Core. It is also possible to pass additional parameters to a custom generic test.
As our data transformations become more complex, the need for testing becomes increasingly important. The concept of unit testing is already well established in software development, where tests confirm that individual units of code work as intended. Recognizing this, dbt 1.8 introduced unit testing.
Unlike the data tests we have above, which ensure that incoming data meets specific criteria and are run at every data refresh, unit tests are designed to verify that the transformation logic itself produces the expected results. In the context of dbt, unit tests validate transformation logic by comparing the test results against predefined data typically defined using seeds (CSV files) or SQL queries. Unit tests should only be executed when new data transformation code is introduced and implemented since they are designed to help catch potential issues early in the development process. It is recommended to run unit tests only during the CI step. Running them in production would be a redundant use of compute resources because the expected outcomes do not change. Unit testing is only available in 1.8 or higher, but there are community packages (dbt-unit-testing, dbt_datamocktool, dbt-unittest) that have worked to solve this problem and are worth exploring if you are not using dbt 1.8.
While not technically a dbt test, a freshness check validates the timeliness of source data. The freshness check in dbt Core is designed to monitor the timeliness of the data. It helps ensure that the data loaded into your warehouse is updated regularly and remains relevant for decision-making processes. This is valuable because sometimes data will stop getting refreshed and the data pipelines will continue to run with a silent failure. To assure that you are alerted when a data delivery SLA is not met, simply add a freshness check to your sources.
This comprehensive suite of testing capabilities in dbt Core ensures that data teams can build, maintain, and verify the reliability and accuracy of their data models effectively.
In addition to the generic tests that can be found within dbt Core, there are a lot more in the dbt ecosystem. These tests are found in dbt packages. Packages are libraries of reusable SQL code created by organizations of the dbt community. We will briefly go over some of the tests that can be found in these packages.
The dbt-utils package, created by dbt Labs, contains generic dbt tests, SQL generators, and macros. The dbt_utils package include 16 generic tests including:
not_accepted_values: this test is the opposite of the accepted_values test and is used to check that specific values are NOT present in a particular range of rows.
equal_rowcount: this test checks that two different tables have the same number of rows. This is a useful test that can assure that a transformation step does not accidentally introduce additional rows in the target table.
fewer_rows_than: this test is used to verify that a target table contains fewer rows than a source table. For example, if you are aggregating a table, you expect that the target table will have fewer rows than the table you are aggregating. This test can help you validate this condition.
There are 17 generic dbt tests available in the dbt-utils package.
Another awesome package that can accelerate your data testing is dbt-expectations. This package is a port of the great Python library Great Expectations. For those not familiar, Great Expectations is an open-source Python library that is used for automated testing. dbt-expectations is modeled after this library and was developed by Calogica so dbt practitioners would have access to an additional set of pre-created Generic tests without adding another tool to the data platform. Tests in dbt-expectations are divided into seven categories encompassing a total of 62 generic dbt tests:
You can find detailed information on all the dbt-expectations generics tests in their documentation.
Created by Snowflake, dbt_constraints adds primary and foreign key constraints to dbt models. When incorporated into a dbt project, this package automatically creates unique keys for all existing unique and dbt_utils.unique_combination_of_columns tests, along with foreign keys for existing relationship tests and not null constraints for not_null tests. It provides three flexible tests - primary_key, unique_key, and foreign_key - which can be used inline, out-of-line, and support multiple columns.
The elementary tool offers 10 generic dbt tests that help in detecting schema changes, validating JSON schemas, and monitoring anomalies in source freshness, among other functionalities.
dbt-fihr focuses on the healthcare sector, providing 20 generic dbt tests for validating HL7® FHIR® (Fast Healthcare Interoperability Resources) data types, a standard for exchanging healthcare information across different systems.
Maintained by Google, the fhir-dbt-analytics package includes tests that ensure the quality of clinical data. These tests might involve counting the number of FHIR resources to verify expected counts or checking references between FHIR resources.
By leveraging these diverse dbt testing packages, data teams can significantly enhance their data validation processes, ensuring that their data pipelines are robust, accurate, and reliable.
While the tests above run against production data and are run even when none of the dbt code has changed, there are some tests that should be applied during development. This will improve a project's long term maintainability, assure project governance, and validate transformation logic in isolation of production data.
This dbt-meta-testing package contains macros to assert test and documentation coverage leveraging a configuration defined in the dbt_project.yml configuration settings.
While dbt tests are great to test with "real" data, sometimes you may want to test the logic of a transformation with "fake" data. This type of test is called a unit test. The dbt-unit-testing package has all you need to do proper dbt unit testing. (side note, the dbt Core team has announced the unit testing will be part of a future release of dbt although it may not be exactly as done using this package).
dbt_datamocktool can be used to create mock CSV seeds to stand in for the sources and refs that your models use and test that the model produces the expected output as compared with another CSV seed.
The dbt-unittest is a dbt package to enhance dbt package development by providing unit testing macros.
Incorporating automated data validation into CI/CD pipelines helps catch issues early and ensures data accuracy before deployment. By integrating tests into every code change, teams can prevent bad data from reaching production and maintain reliable data pipelines.
dbt-checkpoint is a library that can be leveraged during the development and release life-cycle to assure a level of governance of the dbt project. Typical validations include assuring that dbt models and/or their columns have descriptions and that all the columns in a dbt model (sql) are present in a property file (yml).
Recce is an open-source data validation toolkit for comprehensive PR review in dbt projects. Recce helps to validate the data impact of code changes during development and PR review by enabling you to compare data structure, profiling statistics, and queries between two dbt environments, such as dev and prod. By performing Recce checks, you are able to identify unexpected data impact, validate expected impact, and prevent bad merges and incorrect data entering production.
Recce checks, can be performed during development, automatically as part of CI, and as part of PR review for root cause analysis. The suite of tools in Recce enable you to perform:
Record the results of your data validations in the Checklist and share as part of PR review or discussion with stakeholders.
For full coverage, use Recce’s automated ‘preset checks’ that are triggered with each pull request and automatically post an impact summary to your PR comment.
Recce Cloud users can also take advantage of check-syncing and PR merge-blocking until the reviewer or stakeholders have approved the check results.
By default, dbt will not store the results of a dbt test execution. There is a configuration that can be set for the dbt project or at the specific model level which will have dbt store the failures of the test in a table in the data warehouse. While this is a good start, these test results get overridden each time dbt tests are run. To overcome this deficiency, tools have been developed in the community that store results longitudinally and even provide dashboards of test results.
Elementary is an open source data observability tool for dbt. It simplifies the capture of dbt test results over time, enables testing without having to manually add tests to all your dbt model columns, and has a user interface for viewing test results as well as dbt lineage.
Elementary also provides advanced configurations for generating Slack alerts for dbt tests, enhancing how teams monitor and respond to data quality issues. You can configure alerts based on test results, test statuses, and test durations. Additionally, you can set up recurring alerts based on a schedule that you define, ensuring continuous oversight without constant manual checking.
Key features include:
This comprehensive suite of tools not only sends notifications but also allows for significant customization, ensuring that alerts are meaningful and actionable. The integration of these features into your workflow facilitates better data management and quicker response to potential data discrepancies, streamlining your project's efficiency and reliability.
This dbt Data Quality package is a Snowflake only package that helps users access and report on the outputs from dbt source freshness and dbt test results.
The dbt-tools package makes it simple to store and visualize dbt test results in a BI dashboard.
re_data is an open-source data reliability framework for modern data stack.
When migrating data from one system to another validating that tables match is incredibly important. For this we recommend datacompy to get the job done.
Getting started with dbt testing is simple thanks to the predefined generic dbt tests found within dbt Core and the additional generic tests found in dbt-utils and dbt-expectations. In addition to these juggernauts of the dbt community other organizations in the dbt community have contributed a additional generic tests, tools to improve dbt development, libraries that can help with validation and governance before releasing code to production and tools that can improve data quality observability. If you are using dbt cloud or dbt core you may be interested in reading more about dbt alternatives such as Datacoves which falls under the managed dbt core solutions.
You now know what dbt (data build tool) is all about. You are being productive, but you forgot what `dbt build` does or you forgot what the @ dbt graph operator does. This handy dbt cheat sheet has it all in one place.
With the advent of dbt 1.6, we updated the awesome dbt cheat sheet created originally by Bruno de Lima
We have also moved the dbt jinja sheet sheet to a dedicated post.
This reference summarizes all the dbt commands you may need as you run your dbt jobs or study for your dbt certification.
If you ever wanted to know what the difference between +model and @model is in your dbt run, you will find the answer. Whether you are trying to understand dbt graph operators or what the dbt retry command does, but this cheat sheet has you covered. Check it out below.
These are the principal commands you will use most frequently with dbt. Not all of these will be available on dbt Cloud
dbt development commands | |
---|---|
dbt build | This command will load seeds, perform snapshots, run models, and execute tests |
dbt compile | Generates executable SQL code of dbt models, analysis, and tests and outputs to the target folder |
dbt docs | Generates and serves documentation for the dbt project (dbt docs generate, dbt docs serve) |
dbt retry | Re-executes the last dbt command from the node point of failure. It references run_results.json to determine where to start |
dbt run | Executes compiled SQL for the models in a dbt project against the target database |
dbt run-operation | Is used to invoke a dbt macro from the command line. Typically used to run some arbitrary SQL against a database. |
dbt seed | Loads CSV files located in the seeds folder into the target database |
dbt show | Executes sql query against the target database and without materializing, displays the results to the terminal |
dbt snapshot | Executes "snapshot" jobs defined in the snapshot folder of the dbt project |
dbt source | Provides tools for working with source data to validate that sources are "fresh" |
dbt test | Executes singular and generic tests defined on models, sources, snapshots, and seeds |
The dbt commands above have options that allow you to select and exclude models as well as deferring to another environment like production instead of building dependent models for a given run. This table shows which options are available for each dbt command
dbt command arguments | |
---|---|
dbt build | --select / -s, --exclude, --selector, --resource-type, --defer, --empty, --full-refresh |
dbt compile | --select / -s, --exclude, --selector, --inline |
dbt docs generate | --select / -s, --no-compile, --empty-catalog |
dbt docs serve | --port |
dbt ls / dbt list | --select / -s, --exclude, --selector, --output, --output-keys, --resource-type |
dbt run | --select / -s, --exclude, --selector, --resource-type, --defer, --empty, --full-refresh |
dbt seed | --select / -s, --exclude, --selector |
dbt show | --select / -s, --inline, --limit |
dbt snapshot | --select / -s, --exclude, --selector |
dbt source freshness | --select / -s, --exclude, --selector |
dbt source | --select / -s, --exclude, --selector, --output |
dbt test | --select / -s, --exclude, --selector, --defer |
By combining the arguments above like "-s" with the options below, you can tell dbt which items you want to select or exclude. This can be a specific dbt model, everything in a specific folder, or now with the latest versions of dbt, the specific version of a model you are interested in.
dbt node selectors | |
---|---|
tag | Select models that match a specified tag |
source | Select models that select from a specified source |
path | Select models/sources defined at or under a specific path. |
file / fqn | Used to select a model by its filename, including the file extension (.sql). |
package | Select models defined within the root project or an installed dbt package. |
config | Select models that match a specified node config. |
test_type | Select tests based on their type, singular or generic, data, or unit (unit tests are available only in dbt 1.8) |
test_name | Select tests based on the name of the generic test that defines it. |
state | Select nodes by comparing them against a previous version of the same project, which is represented by a manifest. The file path of the comparison manifest must be specified via the --state flag or DBT_STATE environment variable. |
exposure | Select parent resources of a specified exposure. |
metric | Select parent resources of a specified metric. |
result | The result method is related to the state method described above and can be used to select resources based on their result status from a prior run. |
source_status | Select resource based on source freshness |
group | Select models defined within a group |
access | Selects models based on their access property. |
version | Selects versioned models based on their version identifier and latest version. |
dbt Graph Operator provide a powerful syntax that allow you to hone in on the specific items you want dbt to process.
dbt graph operators | |
---|---|
+ | If "plus" (+) operator is placed at the front of the model selector, + will select all parents of the selected model. If placed at the end of the string, + will select all children of the selected model. |
n+ | With the n-plus (n+) operator you can adjust the behavior of the + operator by quantifying the number of edges to step through. |
@ | The "at" (@) operator is similar to +, but will also include the parents of the children of the selected model. |
* | The "star" (*) operator matches all models within a package or directory. |
The following commands are used less frequently and perform actions like initializing a dbt project, installing dependencies, or validating that you can connect to your database.
project level dbt commands | |
---|---|
dbt clean | By default, this command deletes contents of the dbt_packages and target folders in the dbt project |
dbt clone | In databases that support it, can clone nodes (views/tables) to the current dbt target database, otherwise it creates a view pointing to the other environment |
dbt debug | Validates dbt project setup and tests connection to the database defined in profiles.yml |
dbt deps | Installs dbt package dependencies for the project as defined in packages.yml |
dbt init | Initializes a new dbt project and sets up the users's profiles.yml database connection |
dbt ls / dbt list | Lists resources defined in a dbt project such as modem, tests, and sources |
dbt parse | Parses and validates dbt files. It will fail if there are jinja and yaml errors in the project. It also outputs detailed timing info that may be useful when optimizing large projects |
dbt rpc | DEPRECATED after dbt 1.4. Runs an RPC server that compiles dbt models into SQL that can be submitted to a database by external tools |
The flags below immediately follow the dbt command and go before the subcommand e.g. dbt <FLAG> run
Read the official dbt documentation
dbt command line (CLI) flags (general) | |
---|---|
-x, --fail-fast / --no-fail-fast | Stop dbt execution as soon as a failure occurs. |
-h, --help | Shows command help documentation |
--send-anonymous-usage-stats / --no-send-anonymous-usage-stats | Send anonymous dbt usage statistics to dbt Labs. |
-V, -v, --version | Returns information about the installed dbt version |
--version-check / --no-version-check | Ensures or ignores that the installed dbt version matches the require-dbt-version specified in the dbt_project.yml file. |
--warn-error | If dbt would normally warn, instead raise an exception. |
--warn-error-options WARN_ERROR_OPTIONS | Allows for granular control over exactly which types of warnings are treated as errors. This argument receives a YAML string like '{"include": "all"}. |
--write-json / --no-write-json | Whether or not to write the manifest.json and run_results.json files to the target directory |
dbt CLI flags (logging and debugging) | |
---|---|
-d, --debug / --no-debug | Display debug logging during dbt execution useful for debugging and making bug reports. Not to be confused with the dbt debug command which tests database connection. |
--log-cache-events / --no-log-cache-events | Enable verbose logging for relational cache events to help when debugging. |
--log-format [text|debug|json|default] | Specify the format of logging to the console and the log file. |
--log-format-file [text|debug|json|default] | Specify the format of logging to the log file by overriding the default format |
--log-level [debug|info|warn|error|none] | Specify the severity of events that are logged to the console and the log file. |
--log-level-file [debug|info|warn|error|none] | Specify the severity of events that are logged to the log file by overriding the default log level |
--log-path PATH | Configure the 'log-path'. Overrides 'DBT_LOG_PATH' if it is set. |
--print / --no-print | Outputs or hides all {{ print() }} statements within a macro call. |
--printer-width INTEGER | Sets the number of characters for terminal output |
-q, --quiet / --no-quiet | Suppress all non-error logging to stdout Does not affect {{ print() }} macro calls. |
--use-colors / --no-use-colors | Specify whether log output is colorized in the terminal |
--use-colors-file / --no-use-colors-file | Specify whether log file output is colorized |
dbt CLI flags (parsing and performance) | |
---|---|
--cache-selected-only / --no-cache-selected-only | Have dbt cache or not cache metadata about all the objects in all the schemas where it might materialize resources |
--partial-parse / --no-partial-parse | Uses or ignores the pickle file in the target folder used to speed up dbt invocations by only reading and parsing modified objects. |
--populate-cache / --no-populate-cache | At start of run, use `show` or `information_schema` queries to populate a relational cache to speed up subsequent materializations |
-r, --record-timing-info PATH | Saves performance profiling information to a file that can be visualized with snakeviz to understand the performance of a dbt invocation |
--static-parser / --no-static-parser | Use or disable the static parser. (e.g. no partial parsing if enabled) |
--use-experimental-parser / --no-use-experimental-parser | Enable experimental parsing features. |
As a managed dbt Core solution, the Datacoves platform simplifies the dbt Core experience and retains its inherent flexibility. It effectively bridges the gap, capturing many benefits of dbt Cloud while mitigating the challenges tied to a pure dbt Core setup. See if Datacoves dbt pricing is right for your organization or visit our product page.
Please contact us with any errors or suggestions.
Don’t let platform limitations or maintenance overhead hold you back.
Book a DemoDatacoves is an enterprise DataOps platform with managed dbt Core and Airflow for data transformation and orchestration, as well as VS Code in the browser for development
Apache, Apache Airflow, Airflow, Apache Superset, the Airflow logo, the Apache feather logo, Superset, and the Superset logo are trademarks of the Apache Software Foundation. Dbt, dbt core, dbt logo are trademarks of the dbt Labs, Inc. Airbyte, Airbyte logo are trademarks of the Airbyte, Inc. Snowflake, Snowflake logo are trademarks of the Snowflake Inc.