An Overview of Testing Options for dbt (data build tool)
NOTE: This article was updated in August of 2023 with the most up to date information available.
dbt, also known as data build tool, is a data transformation tool that leverages templated SQL to transform and test data. It was developed by dbt Labs, formerly known as Fishtown Analytics. Data build tool is part of the modern data stack and helps practitioners apply software development best practices to data pipelines. Some of these best practices include code modularity, version control, and continuous testing via its built in data quality framework. In this article we will focus on how data can be tested with dbt via build in functionality and with additional dbt packages and libraries. In future articles we will dive deeper into each of these areas.
dbt tests in dbt Core
In dbt there are two categories of tests: Generic tests and Singular tests. The difference between the two are that Generic tests are reusable functions(dbt macros) for which you do not have to write the SQL query every time while Singular tests are custom tests you write when a generic test does not cover your use case. Regardless of the type of test, behind the scenes the process is the same, dbt will compile code to a SQL SELECT statement and execute it against your database. If any rows are returned by the query, this indicates a failure to dbt.
unique: is a test to verify that every value in a column (e.g. customer_id) contains unique values. This is useful for finding records that may inadvertently be duplicated in your data.
not_null: is a test to check that the values for a given column are always present. This can help you find cases where data in a column suddenly arrives without being populated.
accepted_values: this test is used to validate whether a set of values within a column is present. For example, in a column called payment_status, there can be values like pending, failed, accepted, rejected, etc. This test is used to verify that each row within the column contains one of the different payment statuses, but no other. This is useful to detect changes in the data like when a value gets changed such as accepted being replaced with approved.
relationships: these tests check referential integrity. This type of test is useful when you have related columns (e.g. the customer identifier) in two different tables. One table serves as the “parent” and the other is the “child” table. This is common when one table has a transaction and only lists a customer_id and the other table has the details for that customer. With this test we can verify that every row in the transaction table has a corresponding record in the dimension/details table. For example, if you have orders for customer_ids 1, 2, 3 we can validate that we have information about each of these customers in the customer details table.
Using a generic test is done by adding it to the model's property (yml) file.
Generic tests can accept additional test configurations such as a where clause to apply the test on a subset of rows. This can be useful on large tables by limiting the test to recent data or excluding rows based on the value of another column. Since an error will stop a dbt build or dbt test of the project, it is also possible to assign a severity to a test and optionally a threshold where errors will be treated as warning instead of errors. Finally, since dbt will automatically generate a name for the test, it may be useful to override the auto generated test name for simplicity. Here's the same property file from above with the additional configurations defined.
dbt Core Singular tests
Singular tests allow you to create tests when the default generic ones (or the ones in the packages discussed below) do not meet your needs. These tests are simple SQL queries that express assertions about your data. An example of this type of test can be a more complex assertion such as having sales for one product be within +/- 10% of another product. The SQL simply needs to return the rows that do not meet this condition.
dbt Core Custom Generic tests
In dbt, it is also possible to define your own custom generic tests. This may be useful when you find yourself creating similar Singular tests. A custom generic test is essentially the same as a dbt macro which has a least a model as a parameter, and optionally column_name, if the test will apply to a column. Once the generic test is defined, it can be applied many times just like the generic tests shipped with dbt Core. It is also possible to pass additional parameters to a custom generic test.
dbt Core Freshness Check
While not technically a dbt test, a freshness check validates the timeliness of source data. This is valuable because sometimes data will stop getting refreshed and the data pipelines will continue to run with a silent failure. To assure that that you are alerted when a data delivery SLA is not met, simply add a freshness check to your sources.
dbt Tests in dbt packages
In addition to the generic tests that can be found within dbt Core, there are a lot more in the dbt ecosystem. These tests are found in dbt packages. Packages are libraries of reusable SQL code created by organizations of the dbt community. We will briefly go over some of the tests that can be found in these packages.
dbt-utils generic dbt tests
The dbt-utils package, created by dbt Labs, contains generic dbt tests, SQL generators, and macros. The dbt_utils package include 16 generic tests including:
not_accepted_values: this test is the opposite of the accepted_values test and is used to check that specific values are NOT present in a particular range of rows.
equal_rowcount: this test checks that two different tables have the same number of rows. This is a useful test that can assure that a transformation step does not accidentally introduce additional rows in the target table.
fewer_rows_than: this test is used to verify that a target table contains fewer rows than a source table. For example, if you are aggregating a table, you expect that the target table will have fewer rows than the table you are aggregating. This test can help you validate this condition.
There are 12 additional generic dbt tests available in the dbt-utils package which can be found here.
dbt-expectations generic dbt tests
Another awesome package that can accelerate your data testing is dbt-expectations. This package is a port of the great Python library Great Expectations. For those not familiar, Great Expectations is an open-source Python library that is used for automated testing. dbt-expectations is modeled after this library and was developed by Calogicaso dbt practitioners would have access to an additional set of pre-created Generic tests without adding another tool to the data platform. Tests in dbt-expectations are divided into seven categories encompassing a total of 62 generic dbt tests:
Table shape (15 generic dbt tests)
Missing values, unique values, and types (6 generic dbt tests)
Sets and ranges (5 generic dbt tests)
String matching (10 generic dbt tests)
Aggregate functions(17 generic dbt tests)
Multi-column (6 generic dbt tests)
Distributional functions (3 generic dbt tests)
You can find detailed information on all the dbt-expectations generics tests in their documentation.
The dbt_constraints was created by Snowflake to add primary and foreign key constraints to dbt models. When added to a dbt project, this package will automatically create unique keys for all your existing unique and dbt_utils.unique_combination_of_columns tests, foreign keys for existing relationship tests, and not null constraints for not_null tests. The package also provides three tests (primary_key, unique_key, and foreign_key) that are a bit more flexible than the standard dbt tests. These tests can be used inline, out-of-line, and can support multiple columns when used in the tests: section of a model. The primary_key test will also cause a not null constraint to be created on each column.
elementary dbt-data-reliabilit generic dbt tests
The elementary tool (more info below) has 10 generic dbt tests that can be leveraged to look for schema changes, validate JSON schemas and look for anomalies in source freshness among other things..
dbt-fihr generic dbt tests
The HL7® FHIR® (Fast Healthcare Interoperability Resources) is a standard that defines how healthcare information can be exchanged between different systems. This package contains 20 generic dbt tests to validate FHIR data types.
fhir-dbt-analytics generic dbt tests
Thefhir-dbt-analytics package is maintained by Google and contains tests to check the quality of clinical data. The metrics might count the number of FHIR resources to compare to expected counts or check references between FHIR resources.
dbt Testing during development
While the tests above run against production data and are run even when none of the dbt code has changed, there are some tests that should be applied during development. This will improve a project's long term maintainability, assure project governance, and validate transformation logic in isolation of production data.
dbt-checkpoint is a library that can be leveraged during the development and release life-cycle to assure a level of governance of the dbt project. Typical validations include assuring that dbt models and/or their columns have descriptions and that all the columns in a dbt model (sql) are present in a property file (yml).
This dbt-meta-testing package contains macros to assert test and documentation coverage leveraging a configuration defined in the dbt_project.yml configuration settings.
While dbt tests are great to test with "real" data, sometimes you may want to test the logic of a transformation with "fake" data. This type of test is called a unit test. The dbt-unit-testing package has all you need to do proper dbt unit testing. (side note, the dbt Core team has announced the unit testing will be part of a future release of dbt although it may not be exactly as done using this package).
dbt_datamocktool can be used to create mock CSV seeds to stand in for the sources and refs that your models use and test that the model produces the expected output as compared with another CSV seed.
The dbt-unittest is a dbt package to enhance dbt package development by providing unit testing macros.
Reporting results of dbt tests
By default, dbt will not store the results of a dbt test execution. There is a configuration that can be set for the dbt project or at the specific model level which will have dbt store the failures of the test in a table in the data warehouse. While this is a good start, these test results get overridden each time dbt tests are run. To overcome this deficiency, tools have been developed in the community that store results longitudinally and even provide dashboards of test results.
Elementary is an open source data observability tool for dbt. It simplifies the capture of dbt test results over time, enables testing without having to manually add tests to all your dbt model columns, and has a user interface for viewing test results as well as dbt lineage. Elementary can also send Slack notifications.
dbt Data Quality package
This dbt Data Quality package is a Snowflake only package that helps users access and report on the outputs from dbt source freshness and dbt test results.
The dbt-tools package makes it simple to store and visualize dbt test results in a BI dashboard.
re_data is an open-source data reliability framework for modern data stack.
Getting started with dbt testing is simple thanks to the predefined generic dbt tests found within dbt Core and the additional generic tests found in dbt-utils and dbt-expectations. In addition to these juggernauts of the dbt community other organizations in the dbt community have contributed a additional generic tests, tools to improve dbt development, libraries that can help with validation and governance before releasing code to production and tools that can improve data quality observability.
Looking for an enterprise data platform?
Datacoves offers managed dbt core and Airflow and can be deployed in your private cloud.