Using dbt with Snowflake is one of the most popular and powerful data stacks today. Are you facing one of these situations:
- You just joined a data team using dbt Core with Snowflake, and you want to set up your local dbt development environment
- Your company is already using Snowflake and wants to try out dbt.
- You are using dbt with another data warehouse and you want to try dbt with Snowflake before migrating.
If you are facing any of these, this guide will help you set up your dbt environment for Snowflake using VS Code as your dbt development environment. We will also cover useful python libraries and VS Code extensions that will make you more productive. Want to become a dbt on Snowflake ninja, keep reading.
To get started, you need to set up your dbt development environment. This includes Python, dbt, and VS Code. You will also need to set up your Snowflake account properly so dbt can work its magic.
While dbt supports versions of Python greater than 3.7, some other tools like Snowpark require Python 3.8, therefore we recommend you stick with that version. You can find the installer for your particular Operating System on this page. Finding the right Windows Installer can be confusing, so look for a link titled Windows installer (64-bit). If you are on a Mac, you can use the Universal Installer macOS 64-bit universal2 installer.
When using dbt, you will also need Git. To setup git, follow this handy guide.
The preferred IDE for dbt is VS Code, you can find it on the official Microsoft site. click the big Download button and install like any other app.
Installing dbt is done using pip. You can find more information on this page.
For dbt Snowflake, simply run:
This will install the latest version of dbt core along with the dbt adapter for Snowflake. If you need to use an older version of dbt, you will need to specify it when you run pip. However, the version of the dbt adapter may not match the version of dbt core.
For example, as of this writing, the last version of dbt core 1.3.x is 1.3.4. However, dbt-snowflake for 1.3.x is 1.3.2. So, if you want to install the latest dbt 1.3.x for snowflake, you would run
This will install dbt-snowflake 1.3.2 along with dbt-core 1.3.4
The final pre-requisite you will need to do is set up a Snowflake user that has been granted a role with the right access to a database where dbt will create views and tables. The role should also have access to a Snowflake Warehouse for compute. Here is a handy guide that gives you the basics. We would recommend a more comprehensive setup for a production deployment, but this will get you going for now.
The key items that are not covered in that guide is that you should create a role ANALYST and a database ANALYTICS_DEV and grant OWNERSHIP of that database to the ANALYST role. The ANALYST role should also be granted USAGE on the TRANSFORMING warehouse. You also need to grant the ANALYST role to your specific user. Don’t run as ACCOUNTADMIN.
This is all needed because when dbt runs it will create a schema for your user in the ANALYTICS_DEV database and you will use the TRANSFORMING warehouse when compute is needed like when dbt creates tables.
If all of this seems confusing or tedious, you should consider a managed dbt solution like dbt Cloud or Datacoves. For more information, checkout our in-depth article where we compare dbt cloud vs dbt core as well as managed dbt core.
Now with all the pre-requisites out of the way, let’s configure dbt to connect with Snowflake.
To initialize your dbt project in Snowflake, dbt has a handy command dbt init. You can configure your dbt Snowflake profile using the dbt init command both for a new and a existing dbt project. First you will need to clone your repo using git. Then, simply run the dbt-init command and go through the prompts.
Once you get your project set up, consider adding a profile_template.yml to your project. As stated on that page, using a profiles template will simplify the dbt init process for users on your team.
To make sure dbt can connect to Snowflake, run dbt debug. If dbt can connect to your Snowflake account, you should see “All checks passed!” If you have problems, then join the dbt Community search the forums or ask a question in the #db-snowflake channel.
Even though dbt performed the setup of your profile.yml to connect to Snowflake with your credentials, it only provides the basic setup. This page provides additional parameters. that you can configure for the Snowflake connection in your profiles.yml file.
If you want to configure those parameters, you will need to open and edit the profiles.yml file. The profiles.yml file created by dbt init will be in your user’s home directory in a subdirectory called .dbt.
One handy configuration parameter to change is reuse_connections. Also, if you use SSO authentication with external browser, you should consider setting up connection caching on Snowflake, otherwise you will be prompted to authenticate for every connection dbt opens to the database.
Now that you have set up your dbt connection to Snowflake, there are some other options you can configure when dbt runs against your Snowflake account. This includes overriding the default warehouse for a specific model, adding query tags, copying grants, etc. This handy page has a lot more information on these dbt snowflake advanced configurations.
Now that you have dbt connecting to your database, let’s talk about some python libraries you should set up to improve how you work with dbt.
dbt-coves is an open source library created by Datacoves to complement dbt development by simplifying tedious tasks like generating staging models. It is a must-have tool for any dbt practitioner who wants to improve their efficiency. dbt-coves will automatically create your source.yml and staging models as well as their corresponding yml(property) files. It also has utilities for backing up Airbyte and Fivetran configurations.
SQLFluff is a Python library for linting SQL code. SQLFluff seamlessly integrates dbt using a templater and it is the only linter compatible with dbt. If you have not heard of code linting it helps you enforce rules on how your SQL code is formatted for example, should everyone use leading or trailing commas, should SQL keywords be upper or lower case. We recommend everyone use a linter as this will improve code readability and long term maintainability.
pre-commit with dbt-checkpoint
dbt-checkpoint is a tool that allows you to make sure your dbt project complies with certain governance rules. For example, you can have a rule that validates whether every dbt model has a description. You can also ensure that every column is documented among many other rules. We also recommend the use of dbt-checkpoint as it will assure developers don’t add technical debt from the start of a project.
In addition to these Python libraries, at Datacoves we set up the development environment with other handy libraries like Snowpark and Streamlit. We believe that flexibility is important especially in enterprise environments. If you want to learn what to consider when selecting a managed dbt core platform, check out our guide.
In addition to Python libraries, you can improve your dbt workflow with Snowflake by installing these VS Code extensions.
The official Snowflake dbt extension keeps you in the flow by bringing the Snowflake interface to VS Code. With it you can explore your database, run queries, and even upload and download files to Snowflake stages. It is a must-have for any Snowflake user.
dbt power user is a VS Code extension that improves the dbt development experience by adding handy shortcuts to run dbt models, tests, and even let’s you preview the result of a dbt model or a CTE within that model.
The SQLFluff VS Code extension is the companion to the SQLFluff python library. It improves the development experience by highlighting linting errors right in line with your SQL code. It even has a handy hover which describes the linting error and links to the SQLFluff documentation.
There are many other great VS Code extensions and at Datacoves we are always improving the dbt developer’s experience by pre-installing them. One recent addition demonstrated on the video below is a ChatGPT extension that allows you to improve the dbt experience by writing documentation in addition to other functions.
Getting started with dbt and Snowflake is not hard and knowing how to streamline the development experience when working with dbt and Snowflake will maximize developer happiness.
Some users may run into issues configuring their development environment. If that happens, check out the #sqlfluff, #tools-dbt-libraries, and #tools-vscode channels on the dbt Slack community. There are many helpful people there always ready to help.
As you can see there are a lot of steps and potential gotchas to get a dbt environment properly configured. This gets more complicated as the number of dbt users increases. Upgrading everyone can also pose a challenge. These reasons and more are why we created the most flexible managed dbt-core environment. If you want your developers to be up and running in minutes with no installation required, reach out and we can show you how we can streamline your teams’ dbt experience with best practices from the start.
Since its inception, dbt has served the needs of small to mid-sized companies to transform data into actionable insights. Now as larger enterprises look to adopt dbt, they must consider the unique complexities of large enterprises. These include integrating with existing systems behind a corporate firewall, orchestrating end-to-end data pipelines, and implementing DataOps processes to transform how teams collaborate across the organization.
This article compares three popular dbt development and deployment options: leveraging dbt Cloud, using dbt Core alone, or using a dbt Core managed platform other than dbt Cloud. To learn more about specific dbt Cloud features and how to implement them using dbt Core using other open-source tools checkout dbt Core vs dbt Cloud - Key Differences.
dbt Cloud is a hosted environment to develop and deploy dbt Core projects. dbt Cloud leverages dbt Core, the open-source data transformation framework we all know as dbt. dbt Cloud pricing starts at $100 per developer for teams or if you are a single developer, they offer a free single seat with limited features.
When companies are ramping up with dbt, one of the hardest parts is setting up and managing dbt IDE environments. Analytic Engineers coming to dbt may not be familiar with concepts like version control with git or using the command line. The dbt Cloud IDE simplifies developer onboarding by providing a web-based SQL IDE to team members so they can easily write, test, and refine data transformation code without having to install anything on their computers. Complexities like starting a git branch are tucked behind a big colorful button so users know that is the first step in their new feature journey.
Developers who are accustomed to more versatile local IDEs, such as VS Code, may find the dbt Cloud experience limiting as they cannot leverage extensions such as those from the VS Code Marketplace nor can they extend dbt Core using the vast array of Python libraries. An example of a common library employed in the dbt community is SQLFluff. With SQLFluff, teams can ensure a minimum level of code quality is achieved across developers.
DataOps is a set of practices that combines data management and software engineering to improve the quality, speed, and collaboration of data analytics teams. In an enterprise environment, implementing DataOps processes early on in a project's lifecycle is crucial.
dbt Cloud can be paired with Continuous Integration (CI) tools like GitHub Actions to validate data transformations before they are added to a production environment.
If CI integration checks include things like SQLFluff and dbt-checkpoint, there is currently no way for users to run these validations in dbt Cloud prior to executing the automated CI checks in GitHub Actions. This may lead to a lot of back and forth between the GitHub Actions run failures and the dbt Cloud IDE to fix and commit the changes. Aspects such as code reviews and approvals will occur in the CI/CD tool of choice such as GitHub and dbt Cloud can report job execution status back to GitHub Actions. This allows teams to know when it is safe to merge changes to their code.
Companies that have tools like Bitbucket, Jira, and Jenkins within their corporate firewall may find it challenging to integrate with dbt Cloud.
dbt Cloud makes deploying a dbt Core project simple. It allows you to define custom environment variables and the specific dbt commands (seed, run, test) that you want to run during production runs. The dbt Cloud scheduler can be configured to trigger at specific intervals using an intuitive UI.
dbt Cloud is primarily focused on running dbt projects. Therefore if a data pipeline has more dependencies, an external orchestration tool may be required. Cases where this is necessary include when you need to connect the Extract and Load step to the Transformation happening in dbt. Without connecting these steps, you have no assurance that the load is complete before the transformation starts. This can be mitigated with freshness checks, but without an explicit connection between the load step and the transformation, you may be delaying transformation unnecessarily.
Fortunately, if you do use an external orchestrator, dbt Cloud offers an API to trigger dbt Cloud jobs from your orchestrator. Another reason an external orchestrator may be needed in an enterprise is when there are additional steps to run after the dbt transformation. These may include sending data to downstream BI tools, marketing campaign tools, MDM systems, or refreshing machine learning models. dbt Cloud recently announced a webhooks API that would allow for triggering steps after transformation. However, if these endpoints reside within a corporate firewall, this may not be an option and an orchestrator like Airflow may be necessary.
The dbt Cloud IDE does not allow for the installation of Python libraries and dbt Cloud cannot create and orchestrate Python scripts that leverage tools like SQLFluff (this is coming to dbt Cloud later this year), dbt-coves, dbt-fal, Snowpark, Permifrost, or Streamlit. Additionally, the dbt Cloud IDE cannot be extended with extensions such as the official Snowflake or Databricks VS Code extensions.
While dbt Cloud can send notifications via email and Slack, it does not support other popular enterprise systems such as MS Teams.
dbt Cloud offers industry-standard SaaS security appropriate for any enterprise. At the enterprise level, it offers Single Sign-On and Role Based Access Controls. Enterprises can also choose to have dbt Cloud provisioned on a dbt Labs managed private cloud account. dbt Cloud is only offered as a managed SaaS solution on AWS and thus cannot be deployed to a company's existing private cloud account.
dbt Cloud is especially attractive for organizations with limited in-house technical resources or those seeking to minimize the time and effort spent on initial setup, maintenance, and upgrades. dbt Cloud runs in the browser and thus removes the need for manual installation and setup of dbt components, allowing your team to focus on data transformation tasks. dbt Cloud handles updates and upgrades for you, so it is simple to stay up-to-date and secure. dbt Cloud includes built-in scheduling capabilities, allowing you to automate dbt runs and testing without relying on external schedulers or orchestration tools.
dbt Cloud may introduce additional complexities in large enterprises with more complex orchestration needs or those that need to host the environment within their private cloud. Finally, the dbt Cloud IDE may feel constraining for users who prefer a VS Code environment.
All these features can in turn reduce the time to value in setting up an end-to-end enterprise data management platform. It is possible to start out with a simple setup, but inevitably not accounting for the breadth of the entire platform may create technical debt and cause rework in the future.
Pros:
- Simplified setup and management of dbt environment
- Enhanced collaboration capabilities for data teams
- Integrated dbt job scheduling and monitoring
- Managed upgrades and updates
- Web-based IDE for streamlined development
- Metadata API
Cons:
- Reliance on a third-party platform
- May be more expensive for larger teams or organizations with complex requirements
- Not able to deploy within the corporate's private cloud
- Unable to leverage VS Code extensions and other Python libraries
- May require additional components or additional SaaS solutions for complex orchestrations
dbt Core is the heart of dbt and since it is open source, anyone can use it freely. That being said, an enterprise data management platform must account for additional requirements such as the development experience, deployment and scheduling dbt runs, and creating a DataOps process. This will impact the total cost of ownership and the platform's time to market.
Leveraging dbt Core allows for ultimate flexibility, but it involves setting up a development environment on each team member's local machine or on a virtual environment such as an AWS workspace. Installation includes installing a popular dbt IDE like VS Code, installing dbt Core, configuring a connection to the data warehouse, and managing additional dependencies such asa compatible version of Python.
While the overall installation process for dbt is similar for both Windows and Mac users, there are some differences and potential gotchas to be aware of. Additionally, most organizations provide users with pre-configured computers and these systems may be delivered with software that conflicts with dbt Core. This complexity can lead to frustration and delays and may require assistance from senior members of the team, pulling them away from value-added work.
Once everything is installed, configuring dbt to connect to a data warehouse can also introduce complexity before a new team member can begin familiarizing themselves with the organization's dbt project structure and coding practices.
An enterprise setup of dbt should also include components that improve developer code quality and productivity. Popular IDEs like VS Code include a vast array of extensions that also improve the development experience. These include dbt Power User, VS Code extension for SQLFluff, and the dbt Snowflake extension.
Configuring a development environment can take anywhere from a few days to as long as a week. Depending on the person’s technical aptitude and experience with dbt. To improve this process, companies spend weeks or months dialing in the installation process and documenting it in knowledge base tools like Confluence or Notion.
A development environment is not evergreen, it must be upgraded as new versions of dbt Core are released for added functionality or to fix bugs. Upgrading can be as error-prone as new installations and some companies opt not to keep their enterprise data platform current. This, however, is ill-advised because this type of technical debt only gets harder to remediate over time.
When using dbt Core for your enterprise data platform, you will need to not only define and configure the automation scripts, but you will also need to ensure that all the components, such as a git server, CI server, CI runners, etc. are all working harmoniously together.
Since dbt Core runs within the corporate firewall, it can be integrated with any CI tool and internal components such as Jira, Bitbucket, and Jenkins. This flexibility comes at a price. To do this well, all the project dependencies must be packed into reusable Docker containers. Notifications will also need to be defined across the various components and all of this will take time and money.
When setting up a deployment environment, companies using dbt Core can leverage any orchestration tool, such as Airflow. They can connect steps prior to or after the dbt transformations and they can trigger any tool that exists within or outside the corporate network.
However, scaling tools like Airflow is not straightforward and may require knowledge of additional technologies like Kubernetes. By default, Airflow uses Python files and therefore, the team working with these orchestration scripts will need to also know Python on top of SQL.
A mature enterprise platform will also require multiple Airflow environments for development, testing, and production, adding complexity and increasing the total cost of ownership.
Using dbt Core is inevitably the most flexible option, which is why many organizations choose to go this route. Companies can run dbt Core and leverage leading code editors like VS Code. They can install any additional Python dependencies, including running any proprietary code that is only available within the corporate firewall. They can also take advantage of the ever-growing VS Code extensions such as those for Snowflake or Databricks.
Teams using dbt Core can send notifications to Slack, MSTeams, or any other platform as long as they develop and maintain these solutions.
However, this extensibility will only increase the platform's total cost of ownership and the time it takes to deploy the end-to-end platform.
One reason some enterprises choose to use dbt Core is that they are not able to leverage SaaS solutions due to compliance or data privacy requirements such as GDPR. By leveraging dbt Core, they can achieve ultimate security as dbt Core would run in and comply with corporate network policies.
Running dbt Core within the corporate firewall also eliminates the need for lengthy risk assessments. However, ensuring that security patches are applied is left to the organization. Managing Single Sign-on with tools like Airflow and setting up Role Based Access Controls are also things that will have to be solved when using dbt Core as a dbt enterprise data management platform.
A lot of the power and flexibility with using dbt Core comes from being able to customize the entire dbt experience. However, all of this comes with the expense of having to build a data management platform. It is not uncommon for organizations to take six months or up to a year to set up and test their custom dbt enterprise data management platform. They will also likely need to hire external consultants who have worked with a myriad of technologies and understand how to integrate them.
Building a platform is not for the faint of heart and since it is something custom to the organization, it will also increase the total cost of ownership. Some organizations get the basics in place just to make their teams productive, but eventually, they find that without proper design and implementation, they cannot take full advantage of the value of dbt and the rest of the modern data stack.
Using dbt Core as a dbt enterprise data platform can be a cost-effective option for smaller teams with limited requirements and budgets. It can also be good for those trying out dbt. Large organizations with stringent security requirements of sensitive data and strong technical platform teams may also leverage this option as it allows them to deploy in a way that is compliant with corporate policies.
As platform complexity grows, so does the cost and time needed to implement the end-to-end platform. Larger organizations should consider whether developing something custom is better than using a managed solution or taking shortcuts that can reduce the value of dbt and the modern data stack.
Pros:
- Full control over the dbt environment and configurations
- No reliance on third-party services
- No subscription costs
- Able to meet corporate security requirements by installing within the corporate firewall
Cons:
- Requires technical expertise in defining the installation steps and documenting them
- May require hands-on involvement during setup and upgrades
- It can be time-consuming and challenging to maintain consistency across team members
- Upgrades and dependency management may be more complex compared to managed solutions
- Resolving installation issues can consume senior team members' time
- Total cost of ownership of the platform may be higher than using a managed solution
- Additional expertise will be needed to set up all the required technologies
- It will take longer to implement than a managed solution
We have seen that dbt Cloud is a great dbt Core managed platform, but it has some limitations for large or complex deployments. On the other hand, using dbt Core alone gives organizations ultimate flexibility, but it comes at an added expense in terms of the time and cost that it takes to stand up and maintain a custom, one-of, enterprise data management platform.
What if you could combine the simplicity of dbt Cloud with the flexibility of dbt Core? A managed dbt Core platform, like Datacoves, can combine some of the best of both worlds. However, not all managed platforms are the same, below we discuss what to look for when considering a managed dbt Core enterprise data platform. Checkout our dbt pricing page for more pricing information.
By far, the tool most of the dbt enterprise users will interact with is the IDE. When considering a managed dbt Core platform, consider the developer experience. Will they have VS Code in the browser? Will you get a curated, best in class dbt IDE like VS Code with integrated libraries like SQLFluff and extensions that simplify the developer workflow? When required, will you be able to customize the developer environment with company specific Python libraries and extensions?
In a best-in-class developer setup, new users are onboarded in minutes with simple configuration screens that remove the need to edit text files such as profiles.yml and remove the complexity of creating and managing SSH keys. Version upgrades of dbt or any dependent library should be transparent to users.
Some platforms will also have added flexibility and let you leverage VS Code for non-dbt development tasks, such as opening a Jupyter notebook environment or a Python development environment with libraries such as Python Black.
Spinning up a pristine environment should be a matter of clicks. To test new dbt versions, you should be able to run your project in an isolated environment before switching all developers to the new version of dbt.
The advantage of a managed dbt Core platform when it comes to DataOps is that you do not have to worry about creating Docker images with all the dependencies you need as you perform your Continuous integration tests. Working with a partner that has experience with enterprise CI/CD platforms like Jenkins will also accelerate your time to value. A knowledgeable dbt Core managed platform partner will know best practices and guide you in setting up a world-class workflow that will scale with your enterprise. Developers should also be able to execute CI validations locally before pushing their changes to GitHub, thus reducing the back and forth to fix issues during the CI checks.
Some managed dbt Core platforms can run within your corporate firewall, allowing you to leverage existing tools like Jira, Bitbucket, Jenkins, or Gitlab. All this simplifies and accelerates the rollout of a dbt data management platform.
Enterprises are complex, and as such, they must consider a flexible orchestration tool that has wide adoption, documentation and can scale to the needs of the company.
A mature enterprise dbt Core platform will include a full-fledged orchestrator such as Airflow. With Airflow, organizations can orchestrate pipelines that include Extract, Load, Transform, and Activation steps. Other aspects to consider is whether Airflow jobs can scale by leveraging technologies such as Kubernetes and whether it will be simple to send notifications to enterprise tools such as MS Teams.
Some dbt Core platforms that leverage dbt have also adopted mechanisms that simplify Airflow DAG creation by letting users create them using yml files vs creating traditional Python-based DAGs.
A complete enterprise data management platform will allow you to easily create multiple Airflow environments for development, testing, and production. It will also eliminate the burden of having to create Docker images with your dbt dependencies.
An enterprise dbt Core platform will streamline the process of ramping up with dbt and it will offer the organization the flexibility it requires. This may include triggering AWS Glue jobs as well as Fivetran jobs. It may include triggering internal and external APIs or it may require a pre-configured development environment with Python libraries like Snowpark or Streamsets.
When it comes to security, different types of industries have different requirements. Regulated industries in Life Sciences or Finance may not want to or and unable to use SaaS solutions, so being able to deploy a dbt enterprise data management platform within the corporate firewall is a must.
Integrating with SSO solutions like Okta or PingID is also a must, and managing user's authorization using existing mechanisms like Active Directory simplifies compliance with corporate policies.
Another area that may be overlooked when using open-source solutions is validating that libraries are up to date and there are no vulnerabilities introduced into the network. A robust dbt Core managed platform will be able to leverage internal registries like Artifactory, which are set up to scan for these issues.
Finally, at enterprise scale, the dbt Core managed platform must have robust Role Based Access Controls to assure users have the appropriate level of access.
Just like dbt Cloud, a reliable enterprise dbt Core data management platform will reduce the time to value. Teams will not have to figure out how to host and configure the multitude of tools and libraries and they will have a starting point for implementing best practices at the start of the project.
By eliminating the need to stand up a platform, teams can focus on change management and new ways of working that will have the biggest impact on the organization. The total cost of ownership will be lower for the company since they do not have to build and constantly maintain their one-of platform. In contrast, they will benefit by working with a partner that works with other large organizations and looks for ways to improve and enhance the platform for everyone.
A managed dbt Core enterprise data platform is necessary for organizations that are not in the business of creating and managing complex technologies and those who need the flexibility that comes with complex enterprise tools and processes. When dbt Core must be deployed within the corporate firewall, an enterprise managed dbt data platform can also be a good option for companies with stringent security requirements to protect sensitive data.
Smaller organizations that want to leverage a managed VS Code environment or who want to implement DataOps from the outset can also realize the benefits of a SaaS managed dbt Core data platform.
A managed dbt Core enterprise data platform gives companies the simplicity of dbt Cloud with the peace of mind of dbt Core. Best of all, if the platform leverages open-source technologies, there is no vendor lock-in.
Pros:
- Simplified developer onboarding
- Robust IDE such as VS Code
- Ability to configure a dbt environment
- Managed orchestrator such as Airflow
- Able to meet corporate security requirements by installing within the corporate firewall
- Requires a lot less technical expertise to set up and run
- Simplified dependency management across the development environment, CI/CD worker, and Airflow
- Lower total cost of ownership
Cons:
- Unable to have full control over the platform
- Reliance on third-party
- Licensing costs
Selecting between dbt Cloud, dbt Core, or a managed dbt Core enterprise platform comes down to the internal experience within an organization and their desire to focus on data vs developing and maintaining a custom dbt platform. The beauty of the dbt ecosystem is that the components are freely available and with enough time and money, anyone can do what they need. However, enterprises that want to move fast and leverage the full potential of dbt, should consider a managed dbt Core enterprise platform such as Datacoves to reduce the total cost of ownership and accelerate time to market. It is important to be aware of the dbt alternatives to make an informed decision.
When you are learning to use a new tool or technology, one of the hardest things is learning all the new terminology. As we pick up language throughout our lives, we develop an association between words and our mental model of what they represent. The next time we see the word again that picture pops up in our head and if the word is now being used to mean something new, we must create a new mental model. . In this post, we introduce some core dbt (data building tool)terminology and how it all connects.
Language understanding is interesting in that once we have a mental model of a term, we have a hard time grasping the new association. I still remember the first time I spoke to someone about the Snowflake Data Warehouse, and they used the term warehouse. To me, the term had two mental models. One was a place where we store a lot of physical goods, type Costco Warehouse into Google and the first result is Costco Wholesale, a large retailer in the US that is so big it is literally a warehouse full of goods.
I have also worked in manufacturing, so I also associated a warehouse as the place where raw materials and finished goods are stored.
In programming, we would say we are overloading the term warehouse to mean different things.
In some programming languages, function overloading or method overloading is the ability to create multiple functions of the same name with different implementations – Wikipedia
We do this type of thing all the time and don’t think twice about it. However, if I say “I need a bass” do you know what I am talking about?
In my Snowflake example, I knew the context was technology and more specifically something to do with databases, so I already had a mental model for a warehouse. It’s even in Wikipedia's description of the company.
Snowflake Inc. is a cloud computing-based data warehousing company - Wikipedia
I knew of data warehouses from Teradata and Amazon (Redshift), so it was natural for me to think of a warehouse as a technology and a place where lots of data is stored. In my mind, I quickly thought of
For those new to the term warehouse, I may have lost you already. Maybe you are new to dbt and you come from the world of tools like Microsoft Excel, Alterix, Tableau, and PowerBI. If you know all this, grant me a few minutes to bring everyone up to speed.
Let’s step back and first define a database.
A database is an organized collection of structured information, or data, typically stored electronically in a computer system - Oracle
Ok, you probably know Excel. You have probably also seen an Excel Workbook with many sheets. If you organize your data neatly in Excel like the image below, we could consider that workbook a database.
Going back to the definition above “organized collection of structured information” you can see that we have structured information, a list of orders with a Date, Order Quantity, and Order Amount. We also have a collection of these, namely Orders and Invoices.
In database terms, we call each Excel sheet a table and each of the columns an attribute.
Now back to a warehouse. This was my mental model of a warehouse.
A data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise - Wikipedia
Again, if you are new to all this jargon, the above definition might not make much sense to you. Going back to our Excel Example. In an organization, you have many people with their own “databases” like the example above. Jane has one, Mario has another, Elena has a third. All have some valuable information we want to combine in order to make better decisions. So instead of keeping these Excel workbooks separately, we put them all together into a database and now we call that a warehouse. We use this central repository for our “business intelligence”
So, knowing all of this, when I heard of a Snowflake warehouse the above is what I thought. It is the place where we have all the data, duh. Just like Redshift and Teradata. But look at what the people at Snowflake did, they changed the meaning on me.
A virtual warehouse, often referred to simply as a “warehouse”, is a cluster of compute resources in Snowflake. - Snowflake
The term warehouse here is no longer about the storage of things it now means “cluster of compute” A what of what?
Ok, let’s break this down. You are probably reading this on a laptop or some other mobile device. That device stores all your documents and when you perform some actions it “computes” certain things. Well, in Snowflake the storage of the information is separate and independent of the computation on the things that are stored. So, you can store things once and connect different “computers” to it. Imagine you were performing a task on your laptop, and it was slow. What if you could reach in your desk drawer, pull out a faster computer, and speed up the task that was slow, well, in Snowflake you can. Also, instead of just having one computer doing the work, they have a cluster of computers working together to get the job done even faster.
As you can see, language is tricky, and creating a shared understanding of it is crucial to advancing your understanding and mastery of the technology. Every Snowflake user develops the new mental model for a warehouse and using it is second nature, but we forget that these terms that are now natural to us may still be confusing to newcomers.
Let’s start with dbt. When you join the dbt Slack community you will inevitably learn that the preferred way to write dbt is all lower case. Not DBT, not Dbt, just dbt. I still don’t know why exactly, but you may have noticed that everyone in this space always puts “dbt (Data Build Tool)”
If you have some knowledge of Behavioral Therapy you may already know that DBT has a different meaning. Dialectical behavior therapy (DBT)
Dialectical behavioral therapy (DBT) is a type of cognitive-behavioral therapy. Cognitive-behavioral therapy tries to identify and change negative thinking patterns and pushes for positive behavioral changes.
Did you notice how they do the inverse? They spell out Dialectical behavior therapy and put DBT in parenthesis. So, maybe the folks at Fishtown Analytics, now dbt Labs came across this other meaning for DBT and chose to differentiate by using lowercase, or maybe it was to mess with all of the newbies lol.
So update your auto-correct and don’t let dbt become DBT or Dbt or you will hear from someone in the community, haha.
Now let’s do a quick rundown of terms you will hear in dbt land which may confuse you as you start your dbt journey. I will link to the documentation with more information. My job here is to hopefully create a good mental model for you, not to teach you all the ins and outs of all of these things.
This is simply some data that you put into a file and make it part of your project. You put it in the seeds folder within your dbt project, but don’t use this as your source to populate your data warehouse, these are typically small files you may use as lookup tables. If you are using an older version of dbt, the folder would be data instead of seeds. That was another source of confusion, so now the term seed and the directory seed are more tightly connected. The format of these files must be CSV, more information can be found via the link above.
Jinja is a templating engine with syntax similar to the Python programming language that allows you to use special placeholders in your SQL code to make it dynamic. The stuff you see with {{ }} is Jinja.
Without Jinja, there is no dbt. I mean it is the combination of Jinja with SQL that gives us the power to do things that would otherwise be very difficult. So, when you see the lineage you get in the dbt documentation, you can thank Jinja for that.
I knew you would have this question. Well, a macro is simply a reusable piece of code. This too adds to the power of dbt. Every newcomer to dbt will quickly learn about the ref and source macros. These are the cornerstone of dbt. They help capture the relationship and sequence of all your data transformations. Sometimes you are using macros and you may not even realize it. Like the not_null test in your yml file, that’s a macro.
Behind the scenes, dbt is taking information in your yml file and sending parameters to this macro. In my example, the parameter model gets replaced with base_cases (along with the database name and schema name) and colum_name gets replaced with cases. The compiled version of this test looks like this:
There are dbt packages like dbt-expectations that extend the core dbt tests by adding a bunch of test macros, so check it out.
What do you do when you have a lot of great macros that you want to share with others in the community? You create a dbt package of course.
But what is a dbt package? A package is simply a mini dbt project that can be incorporated into your dbt project via the packages.yml file. There are a ton of great packages and the first one you will likely run into is dbt-utils. These are handy utilities that will make your life easier. Trust me, go see all the great things in the dbt-utils package.
Packages don’t just have macros though. Remember, they are mini dbt projects, so some packages incorporate some data transformations to help you do your analytics faster. If you and I both need to analyze the performance of our Google Ads, why should we both have to start from scratch? Well, the fine folks over at Fivetran thought the same thing and created a Google Ads package to help.
When you run the command dbt deps, dbt will look at your packages.yml file and download the specified packages to the dbt_packages directory of your dbt project. If you are on an older version of dbt, packages will be downloaded to the dbt_modules directory instead, but again you can see how this could be confusing hence the updated directory name.
There are many packages and new ones arrive regularly. You can see a full listing on dbt hub.
This is the website maintained by dbt Labs with a listing of dbt packages.
As a side note, we at Datacoves also maintain a similar listing of Python libraries that enhance the dbt experience in our dbt Libraries page. Check out all the libraries that exist. From additional database adapters to tools that can extract data from your BI tool and connect it with dbt, there’s a wealth of great open-source projects that take dbt to another level. Keep in mind that you cannot install Python libraries on dbt Cloud.
These are the SQL files you find in the models directory. These files specify how you want to transform your data. By default, each of these files creates a view in the database, but you can change the materialization of a model to something else and for example, have dbt create a table instead.
Materializations define what dbt will do when it runs your models. Basically, when you execute dbt run this is what happens.
All the code that dbt compiles and runs can be found in the dbt target directory
This term can be ambiguous to a new dbt user. This is because in dbt we use it interchangeably to mean two different things. As I used it above, I meant the directory within your dbt project where dbt commands write their output. If you look in this directory, you will see the compiled and run directories where I found the code I showed above.
Now that you know what dbt is doing under the hood, you can look in this directory to see what will be executed in the database. When you need to do some debugging, you should be able to take code directly from the compiled directory and run it on your database.
This is the other meaning for target. It refers to where dbt will create/materialize the objects in your database.
Again, dbt first compiles your model code and creates the files in the compiled directory. It then wraps the compiled code with the specified materialization and saves the resulting code in the run directory. Finally, it executes that code in your database target. It is the final file in the run directory that is executed in your database.
The image above is the code that runs in my Snowflake instance.
But how does dbt know which database target to use? You told it when you set up your dbt profile which is normally stored in a folder called .dbt in your computer's home folder (dbt Cloud and Datacoves both abstract this complexity for you).
When you start using dbt, you learn of a file called profiles.yml This file has your connection information to the database and should be kept secret as it typically contains your username and password.
This file is called profiles, plural, because you can have more than one profile which you eventually realize is where the target database is defined. Here is a case where you can argue that a better name for this file is targets.yml, but you will learn later why the name profiles.yml was probably chosen and why this name makes sense.
Notice above that I have two different dbt targets defined below the word outputs, dev and prd. dbt can only work on one target at a time so if you want to run dbt against two different databases you can specify them here. Just copy the dev target, give it a new name, and change some of the parameters.
Think of the word outputs on line 3 above as targets. Notice in line 2 the line target: dev this tells dbt which target it should use as your default. In my case, unless I specify otherwise, dbt will use the dev target as my default connection. Hence it will replace the Jinja ref macro with my development database.
How would you use the other target? You simply pass the target parameter to the dbt command like
dbt run --target prd or dbt run -t prd
What is that default: thing on the first line of my profiles.yml file?
Well you see, that’s the name given to your dbt profile, which by default is well, default.
The dbt project is what is created when you create a project via the dbt init command. It includes all of the folders you typically associate with a dbt project and includes a configuration file called dbt_project.yml. If you look at your dbt_project.yml file, you will find something similar to this.
In line 10 you can see which profile dbt will look for in your profiles.yml file. If I change that line and try to run dbt, I will get an error.
NOTE: For those paying close attention, you may have seen I used-s and not -m when selecting a specific model to run. This is the new/preferred way to select what dbt will run.
So now you see why profiles.yml is called profiles.yml and not targets.yml, because you can have multiple profiles in the file. In practice, I think people normally only have one profile, but nothing is preventing you from creating more and it might be handy if you have multiple dbt projects each with different connection information.
Those smart folks at Fishtown Analytics build in this flexibility for a very specific use case. You see, they were originally an analytics consulting company and developed dbt to help them do their work more efficiently. You can imagine that they were working with multiple clients whose project timelines overlapped so by having multiple profiles they could point each independent dbt project to a different profile in the profiles.yml file with each client's database connection information. Something like this.
Now that I have a profile called company_a in my profiles.yml that matches what I defined in my dbt_project.yml dbt will run correctly.
There is a ton of stuff to learn in your dbt journey and starting out with a solid foundation can help you better communicate and quickly progress through the learning curve.
Fishtown Analytics, now dbt Labs, created dbt to meet a real need they had and some of their shared vocabularies made it into the names we now use in the community. Those of us who have made it past the initial learning curve sometimes forget how daunting all the terminology can be for a newcomer.
There is a wealth of information you can find in the dbt documentation and our own dbt cheat sheet, but it takes some time to get used to all the new terms and understand how it's all connected. So next time you come across a newbie, think about the term that you are about to use and the mental model they will have when you tell them to update the seed. We need to take our new dbt seeds (people) and mature them into strong trees.
In our previous article we wrote about the various dbt tests, we talked about the importance of testing data and how dbt, a tool developed by dbt Labs, helps data practitioners validate the integrity of their data. In that article we covered the various packages in the dbt ecosystem that can be used to run a variety of tests on data. Many people have legacy ETL processes and are unable to make the move to dbt quickly, but they can still leverage the power of dbt and by doing so slowly begin the transition to this tool. In this article, I’ll discuss how you can use dbt to test and document your data even if you are not using dbt for transformation.
Ideally, we can prevent erroneous data from ever reaching our decision makers and this is what dbt was created to do. dbt allows us to embed software engineering best practices into data transformation. It is the “T” in ELT (Extract, Load, and Transform) and it also helps capture documentation, testing, and lineage. Since dbt uses SQL as the transformation language, we can also add governance and collaboration via DataOps, but that’s a topic for another post.
I often talk to people who find dbt very appealing, but they have a lot of investment in existing tools like Talend, Informatica, SSIS, Python, etc. They often have gaps in their processes around documentation and data quality and while other tools exist, I believe dbt is a good alternative and by leveraging dbt to fill the gaps in your current data processes, you open the door to incrementally moving your transformations to dbt,
Eventually dbt can be fully leveraged as part of the modern data workflow to produce value from data in an agile way. The automated and flexible nature of dbt allows data experts to focus more on exploring data to find insights.
The term ELT can be confusing, some people hear ELT and ETL and think they are fundamentally the same thing. This is muddied by marketers who try to appeal to potential customers by suggesting their tool can do it all. The way I define ELT is by making sure that data is loaded from the source without any filters or transformation. This is EL (Extract and Load). We keep all rows and all columns. Data is replicated even if there is no current need. While this may seem wasteful at first, it allows Analytic and Data Engineers to quickly react to business needs. Have you ever faced the need to answer a question only to find that the field you need was never imported into the data warehouse? This is common especially in traditional thinking where it was costly to store data or when companies had limited resources due to data warehouses that coupled compute with storage. Today, warehouses like Snowflake have removed this constraint so we can load all the data and keep it synchronized with the sources. Another aspect of modern EL solutions is making the process to load and synchronize data simple. Tools like Fivetran and Airbyte allow users to easily load data by simply selecting pre-build connectors for a variety of sources and selecting the destination where the data should land. Gone are the days of creating tables in target data warehouses and dealing with changes when sources add or remove columns. The new way of working is helping users set it and forget it.
In an environment where other transformation tools are used, you can still leverage dbt to address gaps in testing. There are over 70 pre-built tests that can be leveraged, and custom tests can be created by just using SQL. dbt can test data anywhere in the transformation lifecycle. It can be used at the beginning of the workflow to test or verify assumptions about data sources and the best part is that these data sources or models do not need to be a part of any ongoing project within dbt. Imagine you have a raw customer table you are loading into Snowflake. We can connect this table to dbt by creating a source yml file where we tell dbt where to find the table by providing the name of the database, schema, and table. We can then add the columns to the table and while we are at it, we can add descriptions.
The image below illustrates how test would be added for a CUSTOMER table in the SNOWFLAKE_SAMPLE_DATA database in the TPCH_SF100 schema.
So far we have done what you would learn on a standard dbt tutorial, you start with some source, connect it to dbt, and add some tests. But the reality is, dbt doesn’t really care if the table that we are pointing to is a true "source" table or not. To dbt, any table can be a source, even an aggregation, reporting table, or view. The process is the same. You create a yml file, specify the “source” and add tests.
Let’s say we have a table that is an aggregate for the number of customers by market segment. We can add a source that points to this table and check for the existence of specific market segments and a range of customers by segment.
Using this approach, we can leverage the tests available in dbt anywhere in the data transformation pipeline. We can use dbt_utils.equal_rowcount to validate that two relations have the same number of rows to assure that a transformation step does not inadvertently drop some rows.
When we are aggregating, we can also check that the resulting table has fewer rows than the table we are aggregating by using the dbt_utils.fewer_rows_than test.
Notice that you can use the source macro when referring to another model outside of dbt. As long as you register both models as sources, you can refer to them. So when you see documentation that refers to the ref() macro, just substitute with the source macro as I did above.
Also, note that even though documentation may say this is a model test, you can use this in your source: definition as I have done above.
In dbt sources, we can also add documentation like so:
These descriptions will then show up in the dbt docs.
dbt is a great tool for transforming data, capturing documentation, and lineage, but if your company has a lot of transformation scripts using legacy tools, the migration to dbt may seem daunting and you may think you cannot leverage the benefits of dbt.
By leveraging source definitions you can take advantage of dbt’s ecosystem of tests and ability to document even if transformations are done using other tools.
Gradually the organization will realize the power of dbt and you can gradually migrate to dbt. For the data to be trusted, it needs to be documented and tested and dbt can help you in this journey.
Don’t let platform limitations or maintenance overhead hold you back.
Book a DemoDatacoves is an enterprise DataOps platform with managed dbt Core and Airflow for data transformation and orchestration, as well as VS Code in the browser for development
Apache, Apache Airflow, Airflow, Apache Superset, the Airflow logo, the Apache feather logo, Superset, and the Superset logo are trademarks of the Apache Software Foundation. Dbt, dbt core, dbt logo are trademarks of the dbt Labs, Inc. Airbyte, Airbyte logo are trademarks of the Airbyte, Inc. Snowflake, Snowflake logo are trademarks of the Snowflake Inc.