Comparing cooking to data solutions you can trust

You don’t need to build data lake, you need Omakase

5 mins read

In 3 Core Pillars to a Data-Driven Culture, I discussed the reasons why decision makers don’t trust analytics. I then outlined the alignment and change management aspect to any solution. Once you know what you want, how do you deliver it? The cloud revolution has brought in a new set of challenges for organizations which have nothing to do with delivering solutions. The main problem is that people are faced with a Cheesecake Factory menu and most people would be better served with Omakase.

For those who may not be aware, The Cheesecake Factory menu has 23 pages and over 250 items to choose from. There are obviously people who want the variety and there is certainly nothing wrong with that, but my best meals have been where I have left the decision to the chef.

Omakase, in a Japanese restaurant is a meal consisting of dishes selected by the chef, it literally means “I'll leave it up to you”

How does this relate to the analytics landscape? Well, there is a gold rush in the analytics space. There is a lot of investment and there are literally hundreds of tools to choose from. I have been following this development over the last five years and if anything, the introduction of tools has accelerated.

This eye chart represents the ever growing list of analytics tools

Most people are where I was back in 2016. While I have been doing work in this space for many years the cloud and big data space was all new to me. There was a lot I needed to learn and I was always questioning whether I was making the right decision. I know many people today who do POC after POC to see which tool will work the best, I know, I did the same thing.

Contrast this process with my experience learning a web development framework called Ruby on Rails. When I started learning Rails in 2009 I was focused on what I was trying to build, not the set of tools and libraries that are needed to create a modern web application. That’s because Rails is Omakase.

When you select Omakase in Rails you are trusting many people with years of experience and training to share that knowledge with you. Not only does this help you get going faster, but it also brings you into a community of like-minded people. So that when you run into problems, there are people ready to help. Below I present my opinionated view of a three-course meal data stack that can serve most people and the rationale behind it. This solution may not be perfect for everyone, but neither is Rails.

Appetizer: Loading data

You are hungry to get going and start doing analysis, but we need to start off slowly. You want to get the data, but where do you start. Well, there are a few things to consider.

· Where is the data coming from?

· Is it structured into columns and rows or is it semi-structured(JSON)?

· Is it coming in at high velocity?

· How much data are you expecting?

What I find is that many people want to over engineer a solution or focus on optimizing for one dimension which is usually cost since that is simple to grasp. The problem is that if you focus only on cost, you are giving up something else, usually a better user experience. You don’t have a lot of time to evaluate solutions and build extract and load scripts, so let me make this simple. If you start with Snowflake as your database and Fivetran as your Extract and Load solution, you’ll be fine. Yes, there are reasons why not to choose those solutions, but you probably don’t need to worry about them, especially if you are starting out and you are not Apple.

Why Snowflake you ask? Well, I have used Redshift, MS SQLServer, Databricks, Hadoop, Teradata, and others, but when I started using Snowflake I felt like a weight was lifted. It “just worked.” Do you think you will need to mask some data at some point? They have dynamic data masking. Do you want to be able to scale compute or storage independently? They have separate compute and storage too. Do you like waiting for data vendors to extract data from their system and then having to import it on your side? Or do you need to collaborate with partners and send them data? Well,Snowflake has a way for companies to share data securely, gone are the days of moving data around, now you can securely grant access to groups within or outside your organization, simple, elegant. What about enriching your data with external data sources? Well, they have a data marketplace too and this is bound to grow. Security is well thought out too and you can tell they are focused on the user experience because they do things to improve analyst happiness like MATCH_RECOGNIZE. Oh, and they also handle structured and semi-structured data amazingly well and all without having to tweak endless knobs. With one solution I have been able to eliminate the need to answer the questions above because Snowflake can very likely handle your use case regardless of the answer. I can go on and on, but trust me, you’ll be satisfied with your Snowflake appetizer. If it’s good enough for Warren Buffett, it’s good enough for me.

But what about Fivetran you say? Well, because you have better things to do than to replicate data from Google Analytics, Salesforce, Square, Concur, Workday, Google Ads, etc. etc. Here’s the full list of current connectors Fivetran supports. Just set it and forget it. No one will give you a metal for mapping data from standard data sources to Snowflake. So just do the simple thing and let’s get to the main dish.

Do the simple thing and let’s get to the main dish

Main dish: Transforming data

Now that we have all our data sources in Snowflake, what do we do? Well, I haven’t met anyone who doesn’t want to do some level of data quality, documentation, lineage for impact analysis, and do this in a collaborative way that builds trust in the process.

I’ve got you covered. Just use dbt. Yup, that’s it, simple, a single tool that can do documentation, lineage, data quality, and more. dbt is a key component in our DataOps process because it, like Snowflake, just works. It was developed by people who were analysts themselves and appreciated software development best practices like DRY. They knew that SQL is the great common denominator and all it needed was some tooling around it. It’s hard enough finding good analytics engineers let alone finding ones that know Python. Leave the Python to Data Science and first build a solid foundation for your transformation process. Don’t worry, I didn’t forget about your ambition to create great machine learning models, Snowflake has you covered there as well, check out Snowpark.

You will need a little more than dbt in order to schedule your runs and bring some order to what otherwise would become chaos, but dbt will get you a long way there and if you want to know how we solve this with our Datacoves, reach out, we’ll share our knowledge in our 1-hour free consultation.

Dessert: Reporting on data

This three-course meal is quickly coming to an end, but I couldn’t let you go home before you have dessert. You need dashboards, but you also want self-service, then you can’t go wrong with Looker. I am not the only chef saying this, have a look at this.

One big reason for choosing Looker in addition to the above is the fact that version control is part of the process. If you want things that are documented, reused, and follow software development best practices, then you need to have everything in version control. You can no longer depend on the secret recipe that one of your colleagues has on their laptops. People get promoted, move to other companies, forget… and you need to have a data stack that is not brittle. So choose your dessert wisely.

Conclusion

There are a lot of decisions to be made when creating a great meal. You need to know your guests dietary needs, what you have available, and how to turn raw ingredients into a delicious plate. When it comes to data the options and permutations are endless and most people need to get to delivering solutions so decision makers can improve business results. While no solution is perfect, in my experience there are certain ingredients that when put together well enable users to get to building quickly. If you want to deliver analytics your decision makers can trust, just go Omakase.

Dbt & Airflow

Using dbt: Documenting & Testing Data from Various Tools

5 mins read

In our previous article we wrote about the various dbt tests, we talked about the importance of testing data and how dbt, a tool developed by dbt Labs, helps data practitioners validate the integrity of their data. In that article we covered the various packages in the dbt ecosystem that can be used to run a variety of tests on data. Many people have legacy ETL processes and are unable to make the move to dbt quickly, but they can still leverage the power of dbt and by doing so slowly begin the transition to this tool. In this article, I’ll discuss how you can use dbt to test and document your data even if you are not using dbt for transformation.

Why dbt?

Ideally, we can prevent erroneous data from ever reaching our decision makers and this is what dbt was created to do. dbt allows us to embed software engineering best practices into data transformation. It is the “T” in ELT (Extract, Load, and Transform) and it also helps capture documentation, testing, and lineage. Since dbt uses SQL as the transformation language, we can also add governance and collaboration via DataOps, but that’s a topic for another post.

I often talk to people who find dbt very appealing, but they have a lot of investment in existing tools like Talend, Informatica, SSIS, Python, etc. They often have gaps in their processes around documentation and data quality and while other tools exist, I believe dbt is a good alternative and by leveraging dbt to fill the gaps in your current data processes, you open the door to incrementally moving your transformations to dbt,

Eventually dbt can be fully leveraged as part of the modern data workflow to produce value from data in an agile way. The automated and flexible nature of dbt allows data experts to focus more on exploring data to find insights.

Why ELT?

The term ELT can be confusing, some people hear ELT and ETL and think they are fundamentally the same thing. This is muddied by marketers who try to appeal to potential customers by suggesting their tool can do it all. The way I define ELT is by making sure that data is loaded from the source without any filters or transformation. This is EL (Extract and Load). We keep all rows and all columns. Data is replicated even if there is no current need. While this may seem wasteful at first, it allows Analytic and Data Engineers to quickly react to business needs. Have you ever faced the need to answer a question only to find that the field you need was never imported into the data warehouse? This is common especially in traditional thinking where it was costly to store data or when companies had limited resources due to data warehouses that coupled compute with storage. Today, warehouses like Snowflake have removed this constraint so we can load all the data and keep it synchronized with the sources. Another aspect of modern EL solutions is making the process to load and synchronize data simple. Tools like Fivetran and Airbyte allow users to easily load data by simply selecting pre-build connectors for a variety of sources and selecting the destination where the data should land. Gone are the days of creating tables in target data warehouses and dealing with changes when sources add or remove columns. The new way of working is helping users set it and forget it.

Graphical user interface, text, application, email, websiteDescription automatically generated — This is an example of a modern data flow. **Data Loaders** are the tools that do the extracting and loading process to get the data to the **RAW** area of the data warehouse. These tools include Stitch, Fivetran and Airbyte. Now that the data is in the warehouse dbt can be leveraged for the transformation. As you can see above dbt delivers transformed data and also enables snapshotting, testing, documenting, and facilitates deploying.

Want more flexibility? Migrate your dbt Cloud project in under an hour.

Book a call

Plugging in dbt for testing

In an environment where other transformation tools are used, you can still leverage dbt to address gaps in testing. There are over 70 pre-built tests that can be leveraged, and custom tests can be created by just using SQL. dbt can test data anywhere in the transformation lifecycle. It can be used at the beginning of the workflow to test or verify assumptions about data sources and the best part is that these data sources or models do not need to be a part of any ongoing project within dbt. Imagine you have a raw customer table you are loading into Snowflake. We can connect this table to dbt by creating a source yml file where we tell dbt where to find the table by providing the name of the database, schema, and table. We can then add the columns to the table and while we are at it, we can add descriptions.

The image below illustrates how test would be added for a CUSTOMER table in the SNOWFLAKE_SAMPLE_DATA database in the TPCH_SF100 schema.‍

Graphical user interface, text, application, emailDescription automatically generated

‍

We can do tests at the table level. Here we check that the table has between 1 and 10 columns.

‍

A picture containing graphical user interfaceDescription automatically generated — We can also do tests at the column level. In the image above we assure that C_CUSTKEY columns has no duplicates by leveraging dbt’s **unique** test and we check that the column is always populated with the **not_null** test.

Testing non-source tables

So far we have done what you would learn on a standard dbt tutorial, you start with some source, connect it to dbt, and add some tests. But the reality is, dbt doesn’t really care if the table that we are pointing to is a true "source" table or not. To dbt, any table can be a source, even an aggregation, reporting table, or view. The process is the same. You create a yml file, specify the “source” and add tests.

Let’s say we have a table that is an aggregate for the number of customers by market segment. We can add a source that points to this table and check for the existence of specific market segments and a range of customers by segment.

Graphical user interface, textDescription automatically generated

Using this approach, we can leverage the tests available in dbt anywhere in the data transformation pipeline. We can use dbt_utils.equal_rowcount to validate that two relations have the same number of rows to assure that a transformation step does not inadvertently drop some rows.

When we are aggregating, we can also check that the resulting table has fewer rows than the table we are aggregating by using the dbt_utils.fewer_rows_than test.

TextDescription automatically generated with medium confidence

Notice that you can use the source macro when referring to another model outside of dbt. As long as you register both models as sources, you can refer to them. So when you see documentation that refers to the ref() macro, just substitute with the source macro as I did above.

Graphical user interface, text, application, chat or text messageDescription automatically generated

Also, note that even though documentation may say this is a model test, you can use this in your source: definition as I have done above.

Documenting tables

In dbt sources, we can also add documentation like so:

Text, applicationDescription automatically generated

These descriptions will then show up in the dbt docs.

Graphical user interface, application, TeamsDescription automatically generated — By only having sources in dbt docs, you will not have the lineage capability of dbt, but the above is more than many people have.

Conclusion

dbt is a great tool for transforming data, capturing documentation, and lineage, but if your company has a lot of transformation scripts using legacy tools, the migration to dbt may seem daunting and you may think you cannot leverage the benefits of dbt.

By leveraging source definitions you can take advantage of dbt’s ecosystem of tests and ability to document even if transformations are done using other tools.

Gradually the organization will realize the power of dbt and you can gradually migrate to dbt. For the data to be trusted, it needs to be documented and tested and dbt can help you in this journey.

Noel Gomez

You don’t need to build data lake, you need Omakase

Appetizer: Loading data

Main dish: Transforming data

Dessert: Reporting on data

Conclusion

Using dbt: Documenting & Testing Data from Various Tools

Why dbt?

Why ELT?

Want more flexibility? Migrate your dbt Cloud project in under an hour.

Plugging in dbt for testing

Testing non-source tables

Documenting tables

Conclusion

Get our free ebook dbt Cloud vs dbt Core