X

Join the waitlist

Success!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Transform Data

Getting Started with dbt (data build tool), demystifying terminology

When you are starting to use a new tool, one of the hardest things is learning all the new terminology. In this post we introduce some core dbt terminology and how it all connects.

Getting Started with dbt (data build tool), demystifying terminology

When you are learning to use a new tool or technology, one of the hardest things is learning all the new terminology. As we pick up language throughout our lives, we develop an association between words and our mental model of what they represent. The next time we see the word again that picture pops up in our head and if the word is now being used to mean something new, we must create a new mental model.

Language understanding is interesting in that once we have a mental model of a term, we have a hard time grasping the new association. I still remember the first time I spoke to someone about the Snowflake Data Warehouse, and they used the term warehouse. To me, the term had two mental models. One was a place where we store a lot of physical goods, type Costco Warehouse into Google and the first result is Costco Wholesale, a large retailer in the US that is so big it is literally a warehouse full of goods.

I have also worked in manufacturing, so I also associated a warehouse as the place where raw materials and finished goods are stored.

A traditional warehouse of physical goods

In programming, we would say we are overloading the term warehouse to mean different things.

In some programming languages, function overloading or method overloading is the ability to create multiple functions of the same name with different implementations – Wikipedia

 We do this type of thing all the time and don’t think twice about it. However, if I say “I need a bass” do you know what I am talking about?

Bass Guitar

In my Snowflake example, I knew the context was technology and more specifically something to do with databases, so I already had a mental model for a warehouse. It’s even in Wikipedia's description of the company.

Snowflake Inc. is a cloud computing-based data warehousing company - Wikipedia

I knew of data warehouses from Teradata and Amazon (Redshift), so it was natural for me to think of a warehouse as a technology and a place where lots of data is stored. In my mind, I quickly thought of

  • The Redshift Warehouse
  • The Teradata Warehouse
  • The Snowflake Warehouse

For those new to the term warehouse, I may have lost you already.  Maybe you are new to dbt and you come from the world of tools like Microsoft Excel, Alterix, Tableau, and PowerBI. If you know all this, grant me a few minutes to bring everyone up to speed.

Let’s step back and first define a database.

A database is an organized collection of structured information, or data, typically stored electronically in a computer system - Oracle
Nick Carter on Twitter: "@SpeedwaySam the first? Wha wha whaaat?? Let's not  make it the last. Thank you for being a loyal fan. http://t.co/lSfJG7MB7P"  / Twitter
What What What?

Ok, you probably know Excel. You have probably also seen an Excel Workbook with many sheets. If you organize your data neatly in Excel like the image below, we could consider that workbook a database.

Excel Sheet, a type of database

Going back to the definition above “organized collection of structured information” you can see that we have structured information, a list of orders with a Date, Order Quantity, and Order Amount. We also have a collection of these, namely Orders and Invoices.

In database terms, we call each Excel sheet a table and each of the columns an attribute.

Now back to a warehouse. This was my mental model of a warehouse.

A data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise - Wikipedia

Again, if you are new to all this jargon, the above definition might not make much sense to you. Going back to our Excel Example. In an organization, you have many people with their own “databases” like the example above. Jane has one, Mario has another, Elena has a third. All have some valuable information we want to combine in order to make better decisions. So instead of keeping these Excel workbooks separately, we put them all together into a database and now we call that a warehouse. We use this central repository for our “business intelligence”

So, knowing all of this, when I heard of a Snowflake warehouse the above is what I thought. It is the place where we have all the data, duh. Just like Redshift and Teradata.  But look at what the people at Snowflake did, they changed the meaning on me.

A virtual warehouse, often referred to simply as a “warehouse”, is a cluster of compute resources in Snowflake. - Snowflake

The term warehouse here is no longer about the storage of things it now means “cluster of compute” A what of what?

Ok, let’s break this down. You are probably reading this on a laptop or some other mobile device. That device stores all your documents and when you perform some actions it “computes” certain things.  Well, in Snowflake the storage of the information is separate and independent of the computation on the things that are stored.  So, you can store things once and connect different “computers” to it. Imagine you were performing a task on your laptop, and it was slow. What if you could reach in your desk drawer, pull out a faster computer, and speed up the task that was slow, well, in Snowflake you can. Also, instead of just having one computer doing the work, they have a cluster of computers working together to get the job done even faster.

As you can see, language is tricky, and creating a shared understanding of it is crucial to advancing your understanding and mastery of the technology. Every Snowflake user develops the new mental model for a warehouse and using it is second nature, but we forget that these terms that are now natural to us may still be confusing to newcomers.

Understanding dbt (data build tool) terminology

Let’s start with dbt. When you join the dbt Slack community you will inevitably learn that the preferred way to write dbt is all lower case. Not DBT, not Dbt, just dbt.  I still don’t know why exactly, but you may have noticed that everyone in this space always puts “dbt (Data Build Tool)”

If you have some knowledge of Behavioral Therapy you may already know that DBT has a different meaning. Dialectical behavior therapy (DBT)

Dialectical behavioral therapy (DBT) is a type of cognitive-behavioral therapy. Cognitive-behavioral therapy tries to identify and change negative thinking patterns and pushes for positive behavioral changes.

Did you notice how they do the inverse? They spell out Dialectical behavior therapy and put DBT in parenthesis. So, maybe the folks at Fishtown Analytics, now dbt Labs came across this other meaning for DBT and chose to differentiate by using lowercase, or maybe it was to mess with all of the newbies lol. 

So update your auto-correct and don’t let dbt become DBT or Dbt or you will hear from someone in the community, haha.

Now let’s do a quick rundown of terms you will hear in dbt land which may confuse you as you start your dbt journey. I will link to the documentation with more information. My job here is to hopefully create a good mental model for you, not to teach you all the ins and outs of all of these things.

Seed or dbt seed

This is simply some data that you put into a file and make it part of your project. You put it in the seeds folder within your dbt project, but don’t use this as your source to populate your data warehouse, these are typically small files you may use as lookup tables. If you are using an older version of dbt, the folder would be data instead of seeds. That was another source of confusion, so now the term seed and the directory seed are more tightly connected. The format of these files must be CSV, more information can be found via the link above.  

Jinja

Jinja is a templating engine with syntax similar to the Python programming language that allows you to use special placeholders in your SQL code to make it dynamic. The stuff you see with {{ }} is Jinja.

Without Jinja, there is no dbt. I mean it is the combination of Jinja with SQL that gives us the power to do things that would otherwise be very difficult. So, when you see the lineage you get in the dbt documentation, you can thank Jinja for that.

Lineage graph generated by dbt leveraging the source and ref macros

dbt macro

I knew you would have this question. Well, a macro is simply a reusable piece of code. This too adds to the power of dbt. Every newcomer to dbt will quickly learn about the ref and source macros. These are the cornerstone of dbt. They help capture the relationship and sequence of all your data transformations. Sometimes you are using macros and you may not even realize it. Like the not_null test in your yml file, that’s a macro.

Not Null test in a yml file
Not Null test macro

Behind the scenes, dbt is taking information in your yml file and sending parameters to this macro. In my example, the parameter model gets replaced with base_cases (along with the database name and schema name) and colum_name gets replaced with cases. The compiled version of this test looks like this:

Compiled dbt not null test

There are dbt packages like dbt-expectations that extend the core dbt tests by adding a bunch of test macros, so check it out.

dbt package

What do you do when you have a lot of great macros that you want to share with others in the community? You create a dbt package of course.

But what is a dbt package? A package is simply a mini dbt project that can be incorporated into your dbt project via the packages.yml file.  There are a ton of great packages and the first one you will likely run into is dbt-utils. These are handy utilities that will make your life easier. Trust me, go see all the great things in the dbt-utils package.

Packages don’t just have macros though. Remember, they are mini dbt projects, so some packages incorporate some data transformations to help you do your analytics faster. If you and I both need to analyze the performance of our Google Ads, why should we both have to start from scratch?  Well, the fine folks over at Fivetran thought the same thing and created a Google Ads package to help.

When you run the command dbt deps, dbt will look at your packages.yml file and download the specified packages to the dbt_packages directory of your dbt project. If you are on an older version of dbt, packages will be downloaded to the dbt_modules directory instead, but again you can see how this could be confusing hence the updated directory name.

There are many packages and new ones arrive regularly. You can see a full listing on dbt hub.

dbt hub

This is the website maintained by dbt Labs with a listing of dbt packages.  

As a side note, we at Datacoves also maintain a similar listing of Python libraries that enhance the dbt experience in our dbt Libraries page. Check out all the libraries that exist. From additional database adapters to tools that can extract data from your BI tool and connect it with dbt, there’s a wealth of great open-source projects that take dbt to another level. Keep in mind that you cannot install Python libraries on dbt Cloud.

dbt models

These are the SQL files you find in the models directory. These files specify how you want to transform your data. By default, each of these files creates a view in the database, but you can change the materialization of a model to something else and for example, have dbt create a table instead.

Materialization

 Materializations define what dbt will do when it runs your models.  Basically, when you execute dbt run this is what happens.

  1. dbt reads all your files
  2. dbt then compiles the models by replacing the jinja code with the “real” code the database will run e.g. {{ ref(“my_model”) }} becomes my_database.my_schema.my_model
  3. Finally, it wraps the compiled code in the specified materialization, which by default is a view
Original dbt model you create
Compiled model dbt produces. Notice how line 3 was changed to a specific database object
The code that will actually run in the database is the compiled model code wrapped in the materialization, in this case, a create or replace view statement.

All the code that dbt compiles and runs can be found in the dbt target directory

Target

This term can be ambiguous to a new dbt user. This is because in dbt we use it interchangeably to mean two different things. As I used it above, I meant the directory within your dbt project where dbt commands write their output. If you look in this directory, you will see the compiled and run directories where I found the code I showed above.

dbt target directory

Now that you know what dbt is doing under the hood, you can look in this directory to see what will be executed in the database. When you need to do some debugging, you should be able to take code directly from the compiled directory and run it on your database.

dbt target

This is the other meaning for target. It refers to where dbt will create/materialize the objects in your database.

Again, dbt first compiles your model code and creates the files in the compiled directory. It then wraps the compiled code with the specified materialization and saves the resulting code in the run directory. Finally, it executes that code in your database target. It is the final file in the run directory that is executed in your database.

Code in the run directory is sent to your database

The image above is the code that runs in my Snowflake instance.

But how does dbt know which database target to use? You told it when you set up your dbt profile which is normally stored in a folder called .dbt in your computer's home folder (dbtCloud abstracts this complexity for you).

dbt Profile

When you start using dbt, you learn of a file called profiles.yml This file has your connection information to the database and should be kept secret as it typically contains your username and password.

This file is called profiles, plural, because you can have more than one profile which you eventually realize is where the target database is defined.  Here is a case where you can argue that a better name for this file is targets.yml, but you will learn later why the name profiles.yml was probably chosen and why this name makes sense.

Two targets defined in profiles.yml (database connection details collapsed for brevity)

Notice above that I have two different dbt targets defined below the word outputs, dev and prd.  dbt can only work on one target at a time so if you want to run dbt against two different databases you can specify them here. Just copy the dev target, give it a new name, and change some of the parameters.

Think of the word outputs on line 3 above as targets.  Notice in line 2 the line target: dev this tells dbt which target it should use as your default. In my case, unless I specify otherwise, dbt will use the dev target as my default connection. Hence it will replace the Jinja ref macro with my development database.

Line 3 shows what the ref macro gets replaced with using the default target in the profiles.yml file when compiling this model

How would you use the other target? You simply pass the target parameter to the dbt command like

dbt run --target prd or dbt run -t prd

What is that default: thing on the first line of my profiles.yml file?

My profiles.yml starts with the word default

Well you see, that’s the name given to your dbt profile, which by default is well, default.

dbt project

The dbt project is what is created when you create a project via the dbt init command. It includes all of the folders you typically associate with a dbt project and includes a configuration file called dbt_project.yml. If you look at your dbt_project.yml file, you will find something similar to this.

Line 10 shows which profile dbt will use from within your profiles.yml file

In line 10 you can see which profile dbt will look for in your profiles.yml file. If I change that line and try to run dbt, I will get an error.

New profile name that does not match what is in my profiles.yml file
dbt run fails because it didn't find the company_a profile in my profiles.yml file

NOTE: For those paying close attention, you may have seen I used-s and not -m when selecting a specific model to run.  This is the new/preferred way to select what dbt will run.

So now you see why profiles.yml is called profiles.yml and not targets.yml, because you can have multiple profiles in the file. In practice, I think people normally only have one profile, but nothing is preventing you from creating more and it might be handy if you have multiple dbt projects each with different connection information.

Those smart folks at Fishtown Analytics build in this flexibility for a very specific use case. You see, they were originally an analytics consulting company and developed dbt to help them do their work more efficiently. You can imagine that they were working with multiple clients whose project timelines overlapped so by having multiple profiles they could point each independent dbt project to a different profile in the profiles.yml file with each client's database connection information. Something like this.

profiles.yml with three profiles; default, company_a, and company_b

Now that I have a profile called company_a in my profiles.yml that matches what I defined in my dbt_project.yml dbt will run correctly.

dbt_project.yml pointing to a profile called company_a
dbt run can now find a profile named company_a so it knows what database connection to use

Conclusion

There is a ton of stuff to learn in your dbt journey and starting out with a solid foundation can help you better communicate and quickly progress through the learning curve.

Fishtown Analytics, now dbt Labs, created dbt to meet a real need they had and some of their shared vocabularies made it into the names we now use in the community. Those of us who have made it past the initial learning curve sometimes forget how daunting all the terminology can be for a newcomer.

There is a wealth of information you can find in the dbt documentation but it takes some time to get used to all the new terms and understand how it's all connected. So next time you come across a newbie, think about the term that you are about to use and the mental model they will have when you tell them to update the seed. We need to take our new dbt seeds (people) and mature them into strong trees.