CI/CD With dbt Slim CI: Optimize Using dbt 1.8 --empty Flag
Last Updated:
August 5, 2024
Any experienced data engineer will tell you that efficiency and resource optimization are always top priorities. One powerful feature that can significantly optimize your dbt CI/CD workflow is dbt Slim CI. However, despite its benefits, some limitations have persisted. Fortunately, the recent addition of the --empty
flag in dbt 1.8 addresses these issues. In this article, we will share a GitHub Action Workflow and demonstrate how the new --empty
flag can save you time and resources.
What is dbt Slim CI?
dbt Slim CI is designed to make your continuous integration (CI) process more efficient by running only the models that have been changed and their dependencies, rather than running all models during every CI build. In large projects, this feature can lead to significant savings in both compute resources and time.
Key Benefits of dbt Slim CI
- Speed Up Your Workflows: Slim CI accelerates your CI/CD pipelines by skipping the full execution of all dbt models. Instead, it focuses on only the modified models and their dependencies and uses the defer flag to pull the unmodified models from production. So, if we have model A, B and C yet only make changes to C, then only model C will be run during the CI/CD process.
- Save Time, Snowflake Credits, and Money: By running only the necessary models, Slim CI helps you save valuable build time and Snowflake credits. This selective approach means fewer computational resources are used, leading to cost savings.
dbt Slim CI flags explained
dbt Slim CI is implemented efficiently using these flags:
--select state:modified: The state:modified
selector allows you to choose the models whose "state" has changed (modified) to be included in the run/build. This is done using the state:modified+
selector which tells dbt to run only the models that have been modified and their downstream dependencies.
--state <path to production manifest>: The --state
flag specifies the directory where the artifacts from a previous dbt run are stored ie) the production dbt manifest. By comparing the current branch's manifest with the production manifest, dbt can identify which models have been modified.
--defer: The --defer
flag tells dbt to pull upstream models that have not changed from a different environment (database). Why rebuild something that exists somewhere else? For this to work, dbt will need access to the dbt production manifest.
You may have noticed that there is an additional flag in the command above.
--fail-fast: The --fail-fast
flag is an example of an optimization flag that is not essential to a barebones Slim CI but can provide powerful cost savings. This flag stops the build as soon as an error is encountered instead of allowing dbt to continue building downstream models, therefore reducing wasted builds. To learn more about these arguments you can use have a look at our dbt cheatsheet
dbt Slim CI with Github Actions before dbt 1.8
The following sample Github Actions workflow below is executed when a Pull Request is opened. ie) You have a feature branch that you want to merge into main.
Workflow Steps
Checkout Branch: The workflow begins by checking out the branch associated with the pull request to ensure that the latest code is being used.
Set Secure Directory: This step ensures the repository directory is marked as safe, preventing potential issues with Git operations.
List of Files Changed: This command lists the files changed between the PR branch and the base branch, providing context for the changes and helpful for debugging.
Install dbt Packages: This step installs all required dbt packages, ensuring the environment is set up correctly for the dbt commands that follow.
Create PR Database: This step creates a dedicated database for the PR, isolating the changes and tests from the production environment.
Get Production Manifest: Retrieves the production manifest file, which will be used for deferred runs and governance checks in the following steps.
Run dbt Build in Slim Mode or Run dbt Build Full Run: If a manifest is present in production, dbt will be run in slim mode with deferred models. This build includes only the modified models and their dependencies. If no manifest is present in production we will do a full refresh.
Grant Access to PR Database: Grants the necessary access to the new PR database for end user review.
Generate Docs Combining Production and Branch Catalog: If a dbt test is added to a YAML file, the model will not be run, meaning it will not be present in the PR database. However, governance checks (dbt-checkpoint) will need the model in the database for some checks and if not present this will cause a failure. To solve this, the generate docs step is added to merge the catalog.json from the current branch with the production catalog.json.
Run Governance Checks: Executes governance checks such as SQLFluff and dbt-checkpoint.
Problems with the dbt CI/CD Workflow
As mentioned in the beginning of the article, there is a limitation to this setup. In the existing workflow, governance checks need to run after the dbt build step. This is because dbt-checkpoint relies on the manifest.json and catalog.json. However, if these governance checks fail, it means that the dbt build step will need to run again once the governance issues are fixed. As shown in the diagram below, after running our dbt build, we proceed with governance checks. If these checks fail, we need to resolve the issue and re-trigger the pipeline, leading to another dbt build. This cycle can lead to unnecessary model builds even when leveraging dbt Slim CI.
Leveraging the --empty Flag for Efficient dbt CI/CD Workflows
The solution to this problem is the --empty
flag in dbt 1.8. This flag allows dbt to perform schema-only dry runs without processing large datasets. It's like building the wooden frame of a house—it sets up the structure, including the metadata needed for governance checks, without filling it with data. The framework is there, but the data itself is left out, enabling you to perform governance checks without completing an actual build.
Let’s see how we can rework our Github Action:
Workflow Steps
Checkout Branch: The workflow begins by checking out the branch associated with the pull request to ensure that the latest code is being used.
Set Secure Directory: This step ensures the repository directory is marked as safe, preventing potential issues with Git operations.
List of Files Changed: This step lists the files changed between the PR branch and the base branch, providing context for the changes and helpful for debugging.
Install dbt Packages: This step installs all required dbt packages, ensuring the environment is set up correctly for the dbt commands that follow.
Create PR Database: This command creates a dedicated database for the PR, isolating the changes and tests from the production environment.
Get Production Manifest: Retrieves the production manifest file, which will be used for deferred runs and governance checks in the following steps.
*NEW* Governance Run of dbt (Slim or Full) with EMPTY Models: If there is a manifest in production, this step runs dbt with empty models using slim mode and using the empty flag. The models will be built in the PR database with no data inside and we can now use the catalog.json to run our governance checks since the models. Since the models are empty and we have everything we need to run our checks, we have saved on compute costs as well as run time.
Generate Docs Combining Production and Branch Catalog: If a dbt test is added to a YAML file, the model will not be run, meaning it will not be present in the PR database. However, governance checks (dbt-checkpoint) will need the model in the database for some checks and if not present this will cause a failure. To solve this, the generate docs step is added to merge the catalog.json from the current branch with the production catalog.json.
Run Governance Checks: Executes governance checks such as SQLFluff and dbt-checkpoint.
Run dbt Build: Runs dbt build using either slim mode or full run after passing governance checks.
Grant Access to PR Database: Grants the necessary access to the new PR database for end user review.
By leveraging the dbt --empty
flag, we can materialize models in the PR database without the computational overhead, as the actual data is left out. We can then use the metadata that was generated during the empty build. If any checks fail, we can repeat the process again but without the worry of wasting any computational resources doing an actual build. The cycle still exists but we have moved our real build outside of this cycle and replaced it with an empty or fake build. Once all governance checks have passed, we can proceed with the real dbt build of the dbt models as seen in the diagram below.
Conclusion
dbt Slim CI is a powerful addition to the dbt toolkit, offering significant benefits in terms of speed, resource savings, and early error detection. However, we still faced an issue of wasted models when it came to failing governance checks. By incorporating dbt 1.8’s --empty
flag into your CI/CD workflows we can reduce wasted model builds to zero, improving the efficiency and reliability of your data engineering processes.
🔗 Watch the vide where Noel explains the --empty
flag implementation in Github Actions: