Datacoves

Explaining big data analytics with baking cakes

Love Desserts? Big Data Analytics & Baking Cakes Analogy

5 mins read

Big Data Analytics - Making information tasty and accessible

No one bakes for the sake of baking alone; cakes are meant to be shared. If no one bought, ate, or gifted someone with their delicious chocolate cake, then there would be no bakers. The same is true for data analytics. Our goal as data practitioners is to feed our organization with the information needed to make decisions.

If our cake doesn’t taste good or isn’t available when people want dessert, then it doesn’t matter that we made it from scratch. When it comes to big data, your goal should be to have the equivalent of a delicious cake - usable data - available when someone needs it.

A person eating desert easily accessible — Image from Pexels

Big Data Analytics - The ever-changing tastes of data consumers

Life would be simple if everyone were happy with a single flavor of cake. Metrics play a crucial role in our organization, and two of the most fundamental ones are ARR (Annual Recurring Revenue) and NRR (Net Recurring Revenue). These metrics are like chocolate and vanilla - they remain popular and relevant. Yet, these flavors alone are not enough. Just as with ice cream flavors, we eventually need to try something new, the same goes for insights. It's important to experiment and explore different perspectives.

When something is novel, we love it. When we first start baking, we are not consistent. But, over time, quality improves. We go from ok, to good, and to great. Even with our newfound expertise, the chocolate cake will become boring. We all want something new.

With data, you often start with a simple metric. Having something is better than nothing, but that only lasts so long. We discover something new about our business and we translate that information into action. This is good for a while, but we will eventually see diminishing returns. While a LTV (Life Time Value) analysis may have a significant impact today, its usefulness is likely to diminish within a short period of time. Your stakeholders will crave something new, something more innovative and updated.

Someone is going to ask for a deep dive into how CAC (Customer Acquisition Cost) impacts LTV, or they might ask for a new kind of icing on that cake you just made. Either way, the point remains – as those around you start making use (or eating) of what you’re providing, they will inevitably ask for something more.

‍

Variety of desert displayed on a table — Image from Pexels

Big Data Analytics - How much expertise do you need to bake successfully?

It depends. Some organizations can do more than enough with the reports and dashboards built into their CRM, web analytics, or e-commerce system. This works for many companies. There is no need to spend extra time and money on more complex data systems when something simple will suffice. There is a reason that chocolate and vanilla are popular flavors; many people like their taste and know what they are getting with their order. The same can be said for your data infrastructure.

However, flexibility is the challenge. You are limited by your reporting options and the data analytics you can access. You can have a cake you can eat, but it’s going to look a certain way, having only one type of icing, and it certainly will not have any premium fillings.

If you want those things, you need to look for something a bit more nuanced.

‍

Different tools needed in baking — Image from Pexels

Big Data Analytics - Start with an Easy Bake Oven

How do you start to accommodate your organization’s new demand for analysis or your family’s newly refined palate? By using new tools and techniques.

The Easy Bake Oven is the simplest first step – you get a pouch of ingredients, mix them, and within a few minutes, you have a cake. You might think it’s a children’s toy, but there are very creative recipes out there for the adventurous types. Unfortunately, you can only go so far; you’re limited by the size of the oven and the speed you can bake each item.

The corollary in the data world is Excel – a fabulous tool, but something that has its limitations. While it is possible to extract data from your tools and manipulate it in a spreadsheet to make it more manageable, you are still limited by the pre-designed extracts offered by the vendor tools. Excel is flexible enough for many, but it is not a perfect end-state reporting solution for everyone.

Eventually, you’ll need to address inconsistencies between systems and automate the process of data prep before it gets into your spreadsheet.

‍

Big Data Analytics - Using more advanced techniques to elevating your baking

We have a feel for the basics, but now we want to improve our process. There are certain things we do repeatedly, regardless of the recipes we’re making. We need to create an assembly line. Whether we’re calculating metrics or mixing the wet ingredients and the dry ingredients, there are steps we need to complete in a specific order and to a certain level of quality.

We’ve moved on from the Easy Bake Oven and are now baking using a full-size oven. We need to be more careful about our measurements, ensure the oven is at the correct temperature, and write down notes about the various steps in our more complicated recipes. We need to make sure that different cakes, fillings, and decorations are ready at the same time, even if they have wildly different prep times. We need consistent results every time we make the cake, and we need others to be able to make the same recipe and achieve the same results by following our instructions.

Documenting and transferring this knowledge to others is difficult. Sure, you can write things down, but it is easy to skip a step that you take for granted. Perhaps your handwriting is hard to read, or someone is having trouble with the oven. You may be aware that using a hair dryer while baking can cause the circuit breaker to trip, but unless you document this information, others will not know this quirk. If your recipe has changed, such as using individual ingredients instead of a pre-made cake mix, it is important to clearly specify these adjustments, otherwise your friends and family may struggle to replicate your delicious cake.

There are plenty of data analytics tools out there that assist at this stage – Alteryx, Tableau Prep, and Datameer are just a few of many. In large enterprise organizations, you might find Informatica Power Center, Talend, or Matillion. These types of tools have graphical user interfaces (GUI); they give you the flexibility to extract data, load data, inspect data and transform it. Many enable you to define and calculate metrics. But they require you to work within each tool’s set of rules and constraints. This works well if you are starting and need something less complex.

The process that was once simple now is not; there are hidden assumptions, configurations, and requirements. You’re not using a pen-and-paper recipe anymore; now you’re working within a new system.

GUI-based tools are great for companies whose workflows fit the way the systems work. But between the way some tools are licensed, and the skill needed to use them, they are only available to the IT organization. This leaves users to find shortcuts, develop workarounds, and become dependent on the business’ shadow IT. Inevitably, you’re going to run into maintainability issues.

Recipes often have plenty of steps and ingredients; “data recipes” are no different. There are dependencies between operations, different run times for different transformations, different release cadences, and data availability SLAs. Your team might be able to manage your entire workflow quite well with one of these tools, but once your team starts introducing custom SQL logic or additional overlapping tools into the ecosystem, you introduce another layer of complexity. The result is an increase in the total cost of ownership.

Often, this complexity is opaque, too. It is not obvious what that GUI component is doing, but they are strung together to build something usable. Over time, the complexity continues to grow; custom SQL logic is introduced, and more steps in the chain. Eventually, abstractions begin to form. The data engineers decide that these processing pipelines look quite similar and can be customized based on some basic configurations. Less overhead, more output.

You’re now on the path of building custom ELT (Extract, Transform, and Load) pipelines, stitched together within the constraints of a GUI-based system. For some companies, this is okay, and it works. But there is a hidden cost – it is harder to maintain high quality inputs and outputs. The layers are tightly coupled and a mistake in one step is not caught until the whole pipeline is complete.

IT may not be aware of downstream issues because they occur in other tools outside of their domain; one change here breaks something else there. This is like buying a ready-made cake mix and “enhancing it” with your custom ingredients. It works until it does not. One day, Duncan Hines changes their ingredients, and without you realizing, there is a bad reaction between your “enhancements” and the new mix. Your once great recipe is not so great anymore, but there was no way for Duncan Hines to know. They expected you to follow their instructions; everything has been going according to plan until now.

Even if your tool has strong version control built in, it’s often difficult to reverse the changes before it is too late. If your recipe calls for 1 cup of sugar but you accidentally add in 11, we don’t want to wait until the cake is baked to discover the error. We want to catch that mistake as soon as it happens.

‍

Big Data Analytics - Building consistence into the baking process

Everything up to this point serves a specific profile of a business, but what happens when the business matures beyond what these tools offer? What happens when you know how to bake a cake, but struggle to consistently produce hundreds or thousands of the same quality cakes?

When we aim to expand our baking operations, it's crucial to maintain consistency among bakers, minimize accidental mistakes, and have the ability to swiftly recover from any errors that occur. We need enough mixers and ovens to support the demand for our cakes, and we need an organized pantry with the correct measuring cups and spoons. We need to know which ingredients are running low, which are delayed, and which have common allergens.

In data analytics, we have our own supply chain, often called ELT. You will find tools like Airbyte and Fivetran are common choices for bringing in our data “ingredient delivery”. They manage data extraction and ingestion so you can skip the manual CSV downloads that once served you so well.

We want to ensure quality, have traceability, document our process, and successfully produce and deliver our cakes. To do this, we need a repeatable process, with clear sequential steps. In the world of baking, we use recipes to achieve this.

All baking recipes have a series of steps, some of which are common across different recipes. For example, creaming eggs and sugar, combining the wet and dry ingredients, and whipping the icing are repeatable steps that, when performed in the correct order, result in a delicious dessert. Recipes also follow a standard format: preheat the oven to the specified temperature, list the ingredients in the order in which they are used, and provide the preparation steps last. The sequence is intentional and provides a clear understanding of what to expect during the baking process.

We can apply this model to our data infrastructure by using a tool called dbt (data build tool). Instead of repeating miscellaneous transformation steps in various places, we can centralize our transformation logic into reusable components. Then we can reference those components throughout our project. We can also identify which data is stale, review the chain of dependencies between transformations, and capture the documentation alongside that logic.

We no longer need a GUI-driven tool to review our data; instead, we use the process as defined in the code to inform our logic and documentation. Our new teammate can now confidently create her cake to the same standard as everyone else; she can be confident that she is avoiding common allergens, too.

Better yet, we have a history of changes to our process and “data recipe”. Version control and code reviews are an expectation, so we know when modifications to our ingredient list will cause a complete change in the final product. Our recipes are no longer scattered, but part of a structured system of reusable, composable steps.

‍

Multiple cakes being baked — Image from Pexels

Big Data Analytics - Are you ready to bake well, consistently?

Maturing the way we build data processes comes down to our readiness. When we started with our Easy Bake Oven, little could go wrong, but little could be tweaked. As we build a more robust system, we can take advantage of its increased flexibility, but we also need to maintain more pieces and ensure quality throughout a more complex process.

We need to know the difference between baking soda and baking powder. Which utensils are best, and which oven cooks most evenly? How do we best organize our new suite of recipes? How do you set up the kitchen and install the appliances? There are many decisions to make, and not every organization is ready to make them. All this can be daunting for even large organizations.

But you don’t have to do things all at once. Many organizations choose to make gradual improvements, transforming their big data process from disorganized to consistent.

You can subscribe to multiple SaaS services like Fivetran or Airbyte for data loading and use providers like dbt Cloud for dbt development. If your work grows increasingly complex, you can make use of another set of tools (such as Astronomer or Dagster) to orchestrate your end-to-end process. You will still need to develop the end-to-end flow, so what you gained in flexibility you have lost in simplicity.

‍

Professional baking a cake — Image from Pexels

Big Data Analytics - Becoming a master baker quickly

This is what we focus on at Datacoves. We aim to help organizations create mature processes, even when they have neither the time nor resources to figure everything out.

We give you “a fully stocked kitchen” - all the appliances, recipes, and best practices to make them work cohesively. You can take the guess work out of your data infrastructure, and instead, use a suite of tools designed to help your team perform timely and efficient analytics.

Whether your company is early in its data analytics journey or ready to take your processes to the next level, we are here to help. If your organization has strong info-sec or compliance requirements, we can also deploy within your private cloud. Datacoves is designed to get you “baking delicious cakes” as soon as possible.

Set aside some time to speak with me and learn how Datacoves has helped both small and large companies deploy mature analytics processes from the start. Also check out our case studies to see some of our customer's journeys.

Imagine yourself baking the next great cake at your organization. You can do it quickly with our help.

‍

Master Bakers making deserts — Image from Pexels

‍

Transform Data

Getting Started with dbt: Demystifying Key Terms

5 mins read

When you are learning to use a new tool or technology, one of the hardest things is learning all the new terminology. As we pick up language throughout our lives, we develop an association between words and our mental model of what they represent. The next time we see the word again that picture pops up in our head and if the word is now being used to mean something new, we must create a new mental model. . In this post, we introduce some core dbt (data building tool)terminology and how it all connects.

Language understanding is interesting in that once we have a mental model of a term, we have a hard time grasping the new association. I still remember the first time I spoke to someone about the Snowflake Data Warehouse, and they used the term warehouse. To me, the term had two mental models. One was a place where we store a lot of physical goods, type Costco Warehouse into Google and the first result is Costco Wholesale, a large retailer in the US that is so big it is literally a warehouse full of goods.

I have also worked in manufacturing, so I also associated a warehouse as the place where raw materials and finished goods are stored.

In programming, we would say we are overloading the term warehouse to mean different things.

In some programming languages, function overloading or method overloading is the ability to create multiple functions of the same name with different implementations – Wikipedia

We do this type of thing all the time and don’t think twice about it. However, if I say “I need a bass” do you know what I am talking about?

In my Snowflake example, I knew the context was technology and more specifically something to do with databases, so I already had a mental model for a warehouse. It’s even in Wikipedia's description of the company.

Snowflake Inc. is a cloud computing-based data warehousing company - Wikipedia

I knew of data warehouses from Teradata and Amazon (Redshift), so it was natural for me to think of a warehouse as a technology and a place where lots of data is stored. In my mind, I quickly thought of

The Redshift Warehouse
The Teradata Warehouse
The Snowflake Warehouse

For those new to the term warehouse, I may have lost you already. Maybe you are new to dbt and you come from the world of tools like Microsoft Excel, Alterix, Tableau, and PowerBI. If you know all this, grant me a few minutes to bring everyone up to speed.

Let’s step back and first define a database.

A database is an organized collection of structured information, or data, typically stored electronically in a computer system - Oracle

Nick Carter on Twitter: "@SpeedwaySam the first? Wha wha whaaat?? Let's not make it the last. Thank you for being a loyal fan. http://t.co/lSfJG7MB7P" / Twitter — What What What?

Ok, you probably know Excel. You have probably also seen an Excel Workbook with many sheets. If you organize your data neatly in Excel like the image below, we could consider that workbook a database.

Going back to the definition above “organized collection of structured information” you can see that we have structured information, a list of orders with a Date, Order Quantity, and Order Amount. We also have a collection of these, namely Orders and Invoices.

In database terms, we call each Excel sheet a table and each of the columns an attribute.

Now back to a warehouse. This was my mental model of a warehouse.

A data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise - Wikipedia

Again, if you are new to all this jargon, the above definition might not make much sense to you. Going back to our Excel Example. In an organization, you have many people with their own “databases” like the example above. Jane has one, Mario has another, Elena has a third. All have some valuable information we want to combine in order to make better decisions. So instead of keeping these Excel workbooks separately, we put them all together into a database and now we call that a warehouse. We use this central repository for our “business intelligence”

So, knowing all of this, when I heard of a Snowflake warehouse the above is what I thought. It is the place where we have all the data, duh. Just like Redshift and Teradata. But look at what the people at Snowflake did, they changed the meaning on me.

A virtual warehouse, often referred to simply as a “warehouse”, is a cluster of compute resources in Snowflake. - Snowflake

The term warehouse here is no longer about the storage of things it now means “cluster of compute” A what of what?

Ok, let’s break this down. You are probably reading this on a laptop or some other mobile device. That device stores all your documents and when you perform some actions it “computes” certain things. Well, in Snowflake the storage of the information is separate and independent of the computation on the things that are stored. So, you can store things once and connect different “computers” to it. Imagine you were performing a task on your laptop, and it was slow. What if you could reach in your desk drawer, pull out a faster computer, and speed up the task that was slow, well, in Snowflake you can. Also, instead of just having one computer doing the work, they have a cluster of computers working together to get the job done even faster.

As you can see, language is tricky, and creating a shared understanding of it is crucial to advancing your understanding and mastery of the technology. Every Snowflake user develops the new mental model for a warehouse and using it is second nature, but we forget that these terms that are now natural to us may still be confusing to newcomers.

Understanding dbt (data build tool) terminology

Let’s start with dbt. When you join the dbt Slack community you will inevitably learn that the preferred way to write dbt is all lower case. Not DBT, not Dbt, just dbt. I still don’t know why exactly, but you may have noticed that everyone in this space always puts “dbt (Data Build Tool)”

If you have some knowledge of Behavioral Therapy you may already know that DBT has a different meaning. Dialectical behavior therapy (DBT)

Dialectical behavioral therapy (DBT) is a type of cognitive-behavioral therapy. Cognitive-behavioral therapy tries to identify and change negative thinking patterns and pushes for positive behavioral changes.

Did you notice how they do the inverse? They spell out Dialectical behavior therapy and put DBT in parenthesis. So, maybe the folks at Fishtown Analytics, now dbt Labs came across this other meaning for DBT and chose to differentiate by using lowercase, or maybe it was to mess with all of the newbies lol.

So update your auto-correct and don’t let dbt become DBT or Dbt or you will hear from someone in the community, haha.

Now let’s do a quick rundown of terms you will hear in dbt land which may confuse you as you start your dbt journey. I will link to the documentation with more information. My job here is to hopefully create a good mental model for you, not to teach you all the ins and outs of all of these things.

Seed or dbt seed

This is simply some data that you put into a file and make it part of your project. You put it in the seeds folder within your dbt project, but don’t use this as your source to populate your data warehouse, these are typically small files you may use as lookup tables. If you are using an older version of dbt, the folder would be data instead of seeds. That was another source of confusion, so now the term seed and the directory seed are more tightly connected. The format of these files must be CSV, more information can be found via the link above.

Jinja

Jinja is a templating engine with syntax similar to the Python programming language that allows you to use special placeholders in your SQL code to make it dynamic. The stuff you see with {{ }} is Jinja.

Without Jinja, there is no dbt. I mean it is the combination of Jinja with SQL that gives us the power to do things that would otherwise be very difficult. So, when you see the lineage you get in the dbt documentation, you can thank Jinja for that.

Lineage graph generated by dbt leveraging the source and ref macros — Lineage graph generated by dbt leveraging the **source** and **ref** macros

dbt macro

I knew you would have this question. Well, a macro is simply a reusable piece of code. This too adds to the power of dbt. Every newcomer to dbt will quickly learn about the ref and source macros. These are the cornerstone of dbt. They help capture the relationship and sequence of all your data transformations. Sometimes you are using macros and you may not even realize it. Like the not_null test in your yml file, that’s a macro.

Behind the scenes, dbt is taking information in your yml file and sending parameters to this macro. In my example, the parameter model gets replaced with base_cases (along with the database name and schema name) and colum_name gets replaced with cases. The compiled version of this test looks like this:

There are dbt packages like dbt-expectations that extend the core dbt tests by adding a bunch of test macros, so check it out.

dbt package

What do you do when you have a lot of great macros that you want to share with others in the community? You create a dbt package of course.

But what is a dbt package? A package is simply a mini dbt project that can be incorporated into your dbt project via the packages.yml file. There are a ton of great packages and the first one you will likely run into is dbt-utils. These are handy utilities that will make your life easier. Trust me, go see all the great things in the dbt-utils package.

Packages don’t just have macros though. Remember, they are mini dbt projects, so some packages incorporate some data transformations to help you do your analytics faster. If you and I both need to analyze the performance of our Google Ads, why should we both have to start from scratch? Well, the fine folks over at Fivetran thought the same thing and created a Google Ads package to help.

When you run the command dbt deps, dbt will look at your packages.yml file and download the specified packages to the dbt_packages directory of your dbt project. If you are on an older version of dbt, packages will be downloaded to the dbt_modules directory instead, but again you can see how this could be confusing hence the updated directory name.

There are many packages and new ones arrive regularly. You can see a full listing on dbt hub.

dbt hub

This is the website maintained by dbt Labs with a listing of dbt packages.

As a side note, we at Datacoves also maintain a similar listing of Python libraries that enhance the dbt experience in our dbt Libraries page. Check out all the libraries that exist. From additional database adapters to tools that can extract data from your BI tool and connect it with dbt, there’s a wealth of great open-source projects that take dbt to another level. Keep in mind that you cannot install Python libraries on dbt Cloud.

dbt models

These are the SQL files you find in the models directory. These files specify how you want to transform your data. By default, each of these files creates a view in the database, but you can change the materialization of a model to something else and for example, have dbt create a table instead.

Materialization

Materializations define what dbt will do when it runs your models. Basically, when you execute dbt run this is what happens.

dbt reads all your files
dbt then compiles the models by replacing the jinja code with the “real” code the database will run e.g. {{ ref(“my_model”) }} becomes my_database.my_schema.my_model
Finally, it wraps the compiled code in the specified materialization, which by default is a view

Compiled model dbt produces. Notice how line 3 was changed to a specific database object

All the code that dbt compiles and runs can be found in the dbt target directory

Target

This term can be ambiguous to a new dbt user. This is because in dbt we use it interchangeably to mean two different things. As I used it above, I meant the directory within your dbt project where dbt commands write their output. If you look in this directory, you will see the compiled and run directories where I found the code I showed above.

Now that you know what dbt is doing under the hood, you can look in this directory to see what will be executed in the database. When you need to do some debugging, you should be able to take code directly from the compiled directory and run it on your database.

dbt target

This is the other meaning for target. It refers to where dbt will create/materialize the objects in your database.

Again, dbt first compiles your model code and creates the files in the compiled directory. It then wraps the compiled code with the specified materialization and saves the resulting code in the run directory. Finally, it executes that code in your database target. It is the final file in the run directory that is executed in your database.

Code in the run directory is sent to your database — Code in the **run** directory is sent to your database

The image above is the code that runs in my Snowflake instance.

But how does dbt know which database target to use? You told it when you set up your dbt profile which is normally stored in a folder called .dbt in your computer's home folder (dbt Cloud and Datacoves both abstract this complexity for you).

dbt Profile

When you start using dbt, you learn of a file called profiles.yml This file has your connection information to the database and should be kept secret as it typically contains your username and password.

This file is called profiles, plural, because you can have more than one profile which you eventually realize is where the target database is defined. Here is a case where you can argue that a better name for this file is targets.yml, but you will learn later why the name profiles.yml was probably chosen and why this name makes sense.

Two targets defined in profiles.yml (database connection details collapsed for brevity)

Notice above that I have two different dbt targets defined below the word outputs, dev and prd. dbt can only work on one target at a time so if you want to run dbt against two different databases you can specify them here. Just copy the dev target, give it a new name, and change some of the parameters.

Think of the word outputs on line 3 above as targets. Notice in line 2 the line target: dev this tells dbt which target it should use as your default. In my case, unless I specify otherwise, dbt will use the dev target as my default connection. Hence it will replace the Jinja ref macro with my development database.

Line 3 shows what the ref macro gets replaced with using the default target in the profiles.yml file when compiling this model

How would you use the other target? You simply pass the target parameter to the dbt command like

dbt run --target prd or dbt run -t prd

What is that default: thing on the first line of my profiles.yml file?

My profiles.yml starts with the word default

Well you see, that’s the name given to your dbt profile, which by default is well, default.

dbt project

The dbt project is what is created when you create a project via the dbt init command. It includes all of the folders you typically associate with a dbt project and includes a configuration file called dbt_project.yml. If you look at your dbt_project.yml file, you will find something similar to this.

Line 10 shows which profile dbt will use from within your profiles.yml file

In line 10 you can see which profile dbt will look for in your profiles.yml file. If I change that line and try to run dbt, I will get an error.

New profile name that does not match what is in my profiles.yml file

dbt run fails because it didn't find the company a profile in my profiles.yml file

NOTE: For those paying close attention, you may have seen I used-s and not -m when selecting a specific model to run. This is the new/preferred way to select what dbt will run.

So now you see why profiles.yml is called profiles.yml and not targets.yml, because you can have multiple profiles in the file. In practice, I think people normally only have one profile, but nothing is preventing you from creating more and it might be handy if you have multiple dbt projects each with different connection information.

Those smart folks at Fishtown Analytics build in this flexibility for a very specific use case. You see, they were originally an analytics consulting company and developed dbt to help them do their work more efficiently. You can imagine that they were working with multiple clients whose project timelines overlapped so by having multiple profiles they could point each independent dbt project to a different profile in the profiles.yml file with each client's database connection information. Something like this.

profiles.yml with three profiles; default, company_a, and company_b

Now that I have a profile called company_a in my profiles.yml that matches what I defined in my dbt_project.yml dbt will run correctly.

dbt_project.yml pointing to a profile called company_a

dbt run can now find a profile named company_a so it knows what database connection to use

Conclusion

There is a ton of stuff to learn in your dbt journey and starting out with a solid foundation can help you better communicate and quickly progress through the learning curve.

Fishtown Analytics, now dbt Labs, created dbt to meet a real need they had and some of their shared vocabularies made it into the names we now use in the community. Those of us who have made it past the initial learning curve sometimes forget how daunting all the terminology can be for a newcomer.

There is a wealth of information you can find in the dbt documentation and our own dbt cheat sheet, but it takes some time to get used to all the new terms and understand how it's all connected. So next time you come across a newbie, think about the term that you are about to use and the mental model they will have when you tell them to update the seed. We need to take our new dbt seeds (people) and mature them into strong trees.

‍

Comparing cooking to data solutions you can trust

Digital Transformation

You don’t need to build data lake, you need Omakase

5 mins read

In 3 Core Pillars to a Data-Driven Culture, I discussed the reasons why decision makers don’t trust analytics. I then outlined the alignment and change management aspect to any solution. Once you know what you want, how do you deliver it? The cloud revolution has brought in a new set of challenges for organizations which have nothing to do with delivering solutions. The main problem is that people are faced with a Cheesecake Factory menu and most people would be better served with Omakase.

For those who may not be aware, The Cheesecake Factory menu has 23 pages and over 250 items to choose from. There are obviously people who want the variety and there is certainly nothing wrong with that, but my best meals have been where I have left the decision to the chef.

Omakase, in a Japanese restaurant is a meal consisting of dishes selected by the chef, it literally means “I'll leave it up to you”

How does this relate to the analytics landscape? Well, there is a gold rush in the analytics space. There is a lot of investment and there are literally hundreds of tools to choose from. I have been following this development over the last five years and if anything, the introduction of tools has accelerated.

This eye chart represents the ever growing list of analytics tools

Most people are where I was back in 2016. While I have been doing work in this space for many years the cloud and big data space was all new to me. There was a lot I needed to learn and I was always questioning whether I was making the right decision. I know many people today who do POC after POC to see which tool will work the best, I know, I did the same thing.

Contrast this process with my experience learning a web development framework called Ruby on Rails. When I started learning Rails in 2009 I was focused on what I was trying to build, not the set of tools and libraries that are needed to create a modern web application. That’s because Rails is Omakase.

When you select Omakase in Rails you are trusting many people with years of experience and training to share that knowledge with you. Not only does this help you get going faster, but it also brings you into a community of like-minded people. So that when you run into problems, there are people ready to help. Below I present my opinionated view of a three-course meal data stack that can serve most people and the rationale behind it. This solution may not be perfect for everyone, but neither is Rails.

Appetizer: Loading data

You are hungry to get going and start doing analysis, but we need to start off slowly. You want to get the data, but where do you start. Well, there are a few things to consider.

· Where is the data coming from?

· Is it structured into columns and rows or is it semi-structured(JSON)?

· Is it coming in at high velocity?

· How much data are you expecting?

What I find is that many people want to over engineer a solution or focus on optimizing for one dimension which is usually cost since that is simple to grasp. The problem is that if you focus only on cost, you are giving up something else, usually a better user experience. You don’t have a lot of time to evaluate solutions and build extract and load scripts, so let me make this simple. If you start with Snowflake as your database and Fivetran as your Extract and Load solution, you’ll be fine. Yes, there are reasons why not to choose those solutions, but you probably don’t need to worry about them, especially if you are starting out and you are not Apple.

Why Snowflake you ask? Well, I have used Redshift, MS SQLServer, Databricks, Hadoop, Teradata, and others, but when I started using Snowflake I felt like a weight was lifted. It “just worked.” Do you think you will need to mask some data at some point? They have dynamic data masking. Do you want to be able to scale compute or storage independently? They have separate compute and storage too. Do you like waiting for data vendors to extract data from their system and then having to import it on your side? Or do you need to collaborate with partners and send them data? Well,Snowflake has a way for companies to share data securely, gone are the days of moving data around, now you can securely grant access to groups within or outside your organization, simple, elegant. What about enriching your data with external data sources? Well, they have a data marketplace too and this is bound to grow. Security is well thought out too and you can tell they are focused on the user experience because they do things to improve analyst happiness like MATCH_RECOGNIZE. Oh, and they also handle structured and semi-structured data amazingly well and all without having to tweak endless knobs. With one solution I have been able to eliminate the need to answer the questions above because Snowflake can very likely handle your use case regardless of the answer. I can go on and on, but trust me, you’ll be satisfied with your Snowflake appetizer. If it’s good enough for Warren Buffett, it’s good enough for me.

But what about Fivetran you say? Well, because you have better things to do than to replicate data from Google Analytics, Salesforce, Square, Concur, Workday, Google Ads, etc. etc. Here’s the full list of current connectors Fivetran supports. Just set it and forget it. No one will give you a metal for mapping data from standard data sources to Snowflake. So just do the simple thing and let’s get to the main dish.

Finish your data appetizer and get the to the main dish.

Main dish: Transforming data

Now that we have all our data sources in Snowflake, what do we do? Well, I haven’t met anyone who doesn’t want to do some level of data quality, documentation, lineage for impact analysis, and do this in a collaborative way that builds trust in the process.

I’ve got you covered. Just use dbt. Yup, that’s it, simple, a single tool that can do documentation, lineage, data quality, and more. dbt is a key component in our DataOps process because it, like Snowflake, just works. It was developed by people who were analysts themselves and appreciated software development best practices like DRY. They knew that SQL is the great common denominator and all it needed was some tooling around it. It’s hard enough finding good analytics engineers let alone finding ones that know Python. Leave the Python to Data Science and first build a solid foundation for your transformation process. Don’t worry, I didn’t forget about your ambition to create great machine learning models, Snowflake has you covered there as well, check out Snowpark.

You will need a little more than dbt in order to schedule your runs and bring some order to what otherwise would become chaos, but dbt will get you a long way there and if you want to know how we solve this with our Datacoves, reach out, we’ll share our knowledge in our 1-hour free consultation.

A great meal starts with great ingredients.

Dessert: Reporting on data

This three-course meal is quickly coming to an end, but I couldn’t let you go home before you have dessert. You need dashboards, but you also want self-service, then you can’t go wrong with Looker. I am not the only chef saying this, have a look at this.

One big reason for choosing Looker in addition to the above is the fact that version control is part of the process. If you want things that are documented, reused, and follow software development best practices, then you need to have everything in version control. You can no longer depend on the secret recipe that one of your colleagues has on their laptops. People get promoted, move to other companies, forget… and you need to have a data stack that is not brittle. So choose your dessert wisely.

Conclusion

There are a lot of decisions to be made when creating a great meal. You need to know your guests dietary needs, what you have available, and how to turn raw ingredients into a delicious plate. When it comes to data the options and permutations are endless and most people need to get to delivering solutions so decision makers can improve business results. While no solution is perfect, in my experience there are certain ingredients that when put together well enable users to get to building quickly. If you want to deliver analytics your decision makers can trust, just go Omakase.

Transform Data

Using dbt: Documenting & Testing Data from Various Tools

5 mins read

In our previous article we wrote about the various dbt tests, we talked about the importance of testing data and how dbt, a tool developed by dbt Labs, helps data practitioners validate the integrity of their data. In that article we covered the various packages in the dbt ecosystem that can be used to run a variety of tests on data. Many people have legacy ETL processes and are unable to make the move to dbt quickly, but they can still leverage the power of dbt and by doing so slowly begin the transition to this tool. In this article, I’ll discuss how you can use dbt to test and document your data even if you are not using dbt for transformation.

Why dbt?

Ideally, we can prevent erroneous data from ever reaching our decision makers and this is what dbt was created to do. dbt allows us to embed software engineering best practices into data transformation. It is the “T” in ELT (Extract, Load, and Transform) and it also helps capture documentation, testing, and lineage. Since dbt uses SQL as the transformation language, we can also add governance and collaboration via DataOps, but that’s a topic for another post.

I often talk to people who find dbt very appealing, but they have a lot of investment in existing tools like Talend, Informatica, SSIS, Python, etc. They often have gaps in their processes around documentation and data quality and while other tools exist, I believe dbt is a good alternative and by leveraging dbt to fill the gaps in your current data processes, you open the door to incrementally moving your transformations to dbt,

Eventually dbt can be fully leveraged as part of the modern data workflow to produce value from data in an agile way. The automated and flexible nature of dbt allows data experts to focus more on exploring data to find insights.

Why ELT?

The term ELT can be confusing, some people hear ELT and ETL and think they are fundamentally the same thing. This is muddied by marketers who try to appeal to potential customers by suggesting their tool can do it all. The way I define ELT is by making sure that data is loaded from the source without any filters or transformation. This is EL (Extract and Load). We keep all rows and all columns. Data is replicated even if there is no current need. While this may seem wasteful at first, it allows Analytic and Data Engineers to quickly react to business needs. Have you ever faced the need to answer a question only to find that the field you need was never imported into the data warehouse? This is common especially in traditional thinking where it was costly to store data or when companies had limited resources due to data warehouses that coupled compute with storage. Today, warehouses like Snowflake have removed this constraint so we can load all the data and keep it synchronized with the sources. Another aspect of modern EL solutions is making the process to load and synchronize data simple. Tools like Fivetran and Airbyte allow users to easily load data by simply selecting pre-build connectors for a variety of sources and selecting the destination where the data should land. Gone are the days of creating tables in target data warehouses and dealing with changes when sources add or remove columns. The new way of working is helping users set it and forget it.

Graphical user interface, text, application, email, websiteDescription automatically generated — This is an example of a modern data flow. **Data Loaders** are the tools that do the extracting and loading process to get the data to the **RAW** area of the data warehouse. These tools include Stitch, Fivetran and Airbyte. Now that the data is in the warehouse dbt can be leveraged for the transformation. As you can see above dbt delivers transformed data and also enables snapshotting, testing, documenting, and facilitates deploying.

Want more flexibility? Migrate your dbt Cloud project in under an hour.

Book a call

Plugging in dbt for testing

In an environment where other transformation tools are used, you can still leverage dbt to address gaps in testing. There are over 70 pre-built tests that can be leveraged, and custom tests can be created by just using SQL. dbt can test data anywhere in the transformation lifecycle. It can be used at the beginning of the workflow to test or verify assumptions about data sources and the best part is that these data sources or models do not need to be a part of any ongoing project within dbt. Imagine you have a raw customer table you are loading into Snowflake. We can connect this table to dbt by creating a source yml file where we tell dbt where to find the table by providing the name of the database, schema, and table. We can then add the columns to the table and while we are at it, we can add descriptions.

The image below illustrates how test would be added for a CUSTOMER table in the SNOWFLAKE_SAMPLE_DATA database in the TPCH_SF100 schema.‍

Graphical user interface, text, application, emailDescription automatically generated

‍

We can do tests at the table level. Here we check that the table has between 1 and 10 columns.

‍

A picture containing graphical user interfaceDescription automatically generated — We can also do tests at the column level. In the image above we assure that C_CUSTKEY columns has no duplicates by leveraging dbt’s **unique** test and we check that the column is always populated with the **not_null** test.

Testing non-source tables

So far we have done what you would learn on a standard dbt tutorial, you start with some source, connect it to dbt, and add some tests. But the reality is, dbt doesn’t really care if the table that we are pointing to is a true "source" table or not. To dbt, any table can be a source, even an aggregation, reporting table, or view. The process is the same. You create a yml file, specify the “source” and add tests.

Let’s say we have a table that is an aggregate for the number of customers by market segment. We can add a source that points to this table and check for the existence of specific market segments and a range of customers by segment.

Graphical user interface, textDescription automatically generated

Using this approach, we can leverage the tests available in dbt anywhere in the data transformation pipeline. We can use dbt_utils.equal_rowcount to validate that two relations have the same number of rows to assure that a transformation step does not inadvertently drop some rows.

When we are aggregating, we can also check that the resulting table has fewer rows than the table we are aggregating by using the dbt_utils.fewer_rows_than test.

TextDescription automatically generated with medium confidence

Notice that you can use the source macro when referring to another model outside of dbt. As long as you register both models as sources, you can refer to them. So when you see documentation that refers to the ref() macro, just substitute with the source macro as I did above.

Graphical user interface, text, application, chat or text messageDescription automatically generated

Also, note that even though documentation may say this is a model test, you can use this in your source: definition as I have done above.

Documenting tables

In dbt sources, we can also add documentation like so:

Text, applicationDescription automatically generated

These descriptions will then show up in the dbt docs.

Graphical user interface, application, TeamsDescription automatically generated — By only having sources in dbt docs, you will not have the lineage capability of dbt, but the above is more than many people have.

Conclusion

dbt is a great tool for transforming data, capturing documentation, and lineage, but if your company has a lot of transformation scripts using legacy tools, the migration to dbt may seem daunting and you may think you cannot leverage the benefits of dbt.

By leveraging source definitions you can take advantage of dbt’s ecosystem of tests and ability to document even if transformations are done using other tools.

Gradually the organization will realize the power of dbt and you can gradually migrate to dbt. For the data to be trusted, it needs to be documented and tested and dbt can help you in this journey.

Noel Gomez

Love Desserts? Big Data Analytics & Baking Cakes Analogy

Big Data Analytics - Making information tasty and accessible

Big Data Analytics - The ever-changing tastes of data consumers

Big Data Analytics - How much expertise do you need to bake successfully?

Big Data Analytics - Start with an Easy Bake Oven

Big Data Analytics - Using more advanced techniques to elevating your baking

Big Data Analytics - Building consistence into the baking process

Big Data Analytics - Are you ready to bake well, consistently?

Big Data Analytics - Becoming a master baker quickly

Getting Started with dbt: Demystifying Key Terms

Understanding dbt (data build tool) terminology

Seed or dbt seed

Jinja

dbt macro

dbt package

dbt hub

dbt models

Materialization

Target

dbt target

dbt Profile

dbt project

Conclusion

You don’t need to build data lake, you need Omakase

Appetizer: Loading data

Main dish: Transforming data

Dessert: Reporting on data

Conclusion

Using dbt: Documenting & Testing Data from Various Tools

Why dbt?

Why ELT?

Want more flexibility? Migrate your dbt Cloud project in under an hour.

Plugging in dbt for testing

Testing non-source tables

Documenting tables

Conclusion

Get our free ebook dbt Cloud vs dbt Core