Transform Data

Using dbt: Documenting & Testing Data from Various Tools

In our previous article (An overview of testing options in dbt), we talked about the importance of testing data and how dbt, a tool developed by dbt Labs, helps data practitioners validate the integrity of their data. In that article we covered the various packages in the dbt ecosystem that can be used to run a variety of tests on data. Many people have legacy ETL processes and are unable to make the move to dbt quickly, but they can still leverage the power of dbt and by doing so slowly begin the transition to this tool. In this article, I’ll discuss how you can use dbt to test and document your data even if you are not using dbt for transformation.

Why dbt?

Ideally, we can prevent erroneous data from ever reaching our decision makers and this is what dbt was created to do. dbt allows us to embed software engineering best practices into data transformation. It is the “T” in ELT (Extract, Load, and Transform) and it also helps capture documentation, testing, and lineage. Since dbt uses SQL as the transformation language, we can also add governance and collaboration via DataOps, but that’s a topic for another post.

I often talk to people who find dbt very appealing, but they have a lot of investment in existing tools like Talend, Informatica, SSIS, Python, etc. They often have gaps in their processes around documentation and data quality and while other tools exist, I believe dbt is a good alternative and by leveraging dbt to fill the gaps in your current data processes, you open the door to incrementally moving your transformations to dbt,   

Eventually dbt can be fully leveraged as part of the modern data workflow to produce value from data in an agile way. The automated and flexible nature of dbt allows data experts to focus more on exploring data to find insights.

Why ELT?

The term ELT can be confusing, some people hear ELT and ETL and think they are fundamentally the same thing. This is muddied by marketers who try to appeal to potential customers by suggesting their tool can do it all. The way I define ELT is by making sure that data is loaded from the source without any filters or transformation. This is EL (Extract and Load). We keep all rows and all columns. Data is replicated even if there is no current need. While this may seem wasteful at first, it allows Analytic and Data Engineers to quickly react to business needs. Have you ever faced the need to answer a question only to find that the field you need was never imported into the data warehouse? This is common especially in traditional thinking where it was costly to store data or when companies had limited resources due to data warehouses that coupled compute with storage. Today, warehouses like Snowflake have removed this constraint so we can load all the data and keep it synchronized with the sources. Another aspect of modern EL solutions is making the process to load and synchronize data simple. Tools like Fivetran and Airbyte allow users to easily load data by simply selecting pre-build connectors for a variety of sources and selecting the destination where the data should land.  Gone are the days of creating tables in target data warehouses and dealing with changes when sources add or remove columns. The new way of working is helping users set it and forget it.

Graphical user interface, text, application, email, websiteDescription automatically generated
This is an example of a modern data flow. Data Loaders are the tools that do the extracting and loading process to get the data to the RAW area of the data warehouse. These tools include Stitch, Fivetran and Airbyte. Now that the data is in the warehouse dbt can be leveraged for the transformation. As you can see above dbt delivers transformed data and also enables snapshotting, testing, documenting, and facilitates deploying.

Plugging in dbt for testing

In an environment where other transformation tools are used, you can still leverage dbt to address gaps in testing. There are over 70 pre-built tests that can be leveraged, and custom tests can be created by just using SQL. dbt can test data anywhere in the transformation lifecycle. It can be used at the beginning of the workflow to test or verify assumptions about data sources and the best part is that these data sources or models do not need to be a part of any ongoing project within dbt. Imagine you have a raw customer table you are loading into Snowflake. We can connect this table to dbt by creating a source yml file where we tell dbt where to find the table by providing the name of the database, schema, and table. We can then add the columns to the table and while we are at it, we can add descriptions. 

The image below illustrates how test would be added for a CUSTOMER table in the SNOWFLAKE_SAMPLE_DATA database in the TPCH_SF100 schema.

Graphical user interface, text, application, emailDescription automatically generated

Logo, company nameDescription automatically generated
We can do tests at the table level. Here we check that the table has between 1 and 10 columns.

A picture containing graphical user interfaceDescription automatically generated
We can also do tests at the column level. In the image above we assure that C_CUSTKEY columns has no duplicates by leveraging dbt’s unique test and we check that the column is always populated with the not_null test.

Testing non-source tables

So far we have done what you would learn on a standard dbt tutorial, you start with some source, connect it to dbt, and add some tests. But the reality is, dbt doesn’t really care if the table that we are pointing to is a true "source" table or not.  To dbt, any table can be a source, even an aggregation, reporting table, or view.  The process is the same. You create a yml file, specify the “source” and add tests.

Let’s say we have a table that is an aggregate for the number of customers by market segment. We can add a source that points to this table and check for the existence of specific market segments and a range of customers by segment.

Graphical user interface, textDescription automatically generated

Using this approach, we can leverage the tests available in dbt anywhere in the data transformation pipeline. We can use dbt_utils.equal_rowcount to validate that two relations have the same number of rows to assure that a transformation step does not inadvertently drop some rows. 

When we are aggregating, we can also check that the resulting table has fewer rows than the table we are aggregating by using the dbt_utils.fewer_rows_than test.

TextDescription automatically generated with medium confidence

Notice that you can use the source macro when referring to another model outside of dbt. As long as you register both models as sources, you can refer to them. So when you see documentation that refers to the ref() macro, just substitute with the source macro as I did above.

Graphical user interface, text, application, chat or text messageDescription automatically generated

Also, note that even though documentation may say this is a model test, you can use this in your source: definition as I have done above.

Documenting tables

In dbt sources, we can also add documentation like so:

Text, applicationDescription automatically generated

These descriptions will then show up in the dbt docs.

Graphical user interface, application, TeamsDescription automatically generated
By only having sources in dbt docs, you will not have the lineage capability of dbt, but the above is more than many people have.

Conclusion

dbt is a great tool for transforming data, capturing documentation, and lineage, but if your company has a lot of transformation scripts using legacy tools, the migration to dbt may seem daunting and you may think you cannot leverage the benefits of dbt. 

By leveraging source definitions you can take advantage of dbt’s ecosystem of tests and ability to document even if transformations are done using other tools.

Gradually the organization will realize the power of dbt and you can gradually migrate to dbt.  For the data to be trusted, it needs to be documented and tested and dbt can help you in this journey.

Looking for an enterprise data platform?

Datacoves offers managed dbt core and Airflow and can be deployed in your private cloud.

LEARN MORE

Table of Contents

dbt Cloud vs dbt Core: Get Our Free Ebook

Considering dbt and whether you should use open source dbt Core or pay for a managed solution? Our eBook offers an in-depth look at dbt Core, dbt Cloud, and Managed dbt Core options. Gain valuable insights into each alternative and how each solution fits best with your organization.
From small companies to large enterprise environments, this guide is your key to understanding the dbt landscape, pricing, and total cost of ownership.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.