Innovation

Change management, fresh ideas and forward-thinking strategies shaping the future of data engineering and analytics.
Explaining big data analytics with baking cakes
5 mins read

Big Data Analytics - Making information tasty and accessible

No one bakes for the sake of baking alone; cakes are meant to be shared.  If no one bought, ate, or gifted someone with their delicious chocolate cake, then there would be no bakers. The same is true for data analytics. Our goal as data practitioners is to feed our organization with the information needed to make decisions.  

If our cake doesn’t taste good or isn’t available when people want dessert, then it doesn’t matter that we made it from scratch. When it comes to big data, your goal should be to have the equivalent of a delicious cake - usable data - available when someone needs it.

A person eating desert easily accessible
Image from Pexels

Big Data Analytics - The ever-changing tastes of data consumers

Life would be simple if everyone were happy with a single flavor of cake. Metrics play a crucial role in our organization, and two of the most fundamental ones are ARR (Annual Recurring Revenue) and NRR (Net Recurring Revenue). These metrics are like chocolate and vanilla - they remain popular and relevant. Yet, these flavors alone are not enough. Just as with ice cream flavors, we eventually need to try something new, the same goes for insights. It's important to experiment and explore different perspectives.

When something is novel, we love it. When we first start baking, we are not consistent. But, over time, quality improves. We go from ok, to good, and to great. Even with our newfound expertise, the chocolate cake will become boring. We all want something new.  

With data, you often start with a simple metric. Having something is better than nothing, but that only lasts so long. We discover something new about our business and we translate that information into action. This is good for a while, but we will eventually see diminishing returns. While a LTV (Life Time Value) analysis may have a significant impact today, its usefulness is likely to diminish within a short period of time. Your stakeholders will crave something new, something more innovative and updated.

Someone is going to ask for a deep dive into how CAC (Customer Acquisition Cost) impacts LTV, or they might ask for a new kind of icing on that cake you just made. Either way, the point remains – as those around you start making use (or eating) of what you’re providing, they will inevitably ask for something more.  

Variety of desert displayed on a table
Image from Pexels

Big Data Analytics - How much expertise do you need to bake successfully?

It depends. Some organizations can do more than enough with the reports and dashboards built into their CRM, web analytics, or e-commerce system. This works for many companies. There is no need to spend extra time and money on more complex data systems when something simple will suffice. There is a reason that chocolate and vanilla are popular flavors; many people like their taste and know what they are getting with their order. The same can be said for your data infrastructure.

However, flexibility is the challenge. You are limited by your reporting options and the data analytics you can access. You can have a cake you can eat, but it’s going to look a certain way, having only one type of icing, and it certainly will not have any premium fillings.  

If you want those things, you need to look for something a bit more nuanced.  

Different tools needed in baking
Image from Pexels

Big Data Analytics - Start with an Easy Bake Oven

How do you start to accommodate your organization’s new demand for analysis or your family’s newly refined palate? By using new tools and techniques.  

The Easy Bake Oven is the simplest first step – you get a pouch of ingredients, mix them, and within a few minutes, you have a cake. You might think it’s a children’s toy, but there are very creative recipes out there for the adventurous types. Unfortunately, you can only go so far; you’re limited by the size of the oven and the speed you can bake each item.  

The corollary in the data world is Excel – a fabulous tool, but something that has its limitations. While it is possible to extract data from your tools and manipulate it in a spreadsheet to make it more manageable, you are still limited by the pre-designed extracts offered by the vendor tools. Excel is flexible enough for many, but it is not a perfect end-state reporting solution for everyone.  

Eventually, you’ll need to address inconsistencies between systems and automate the process of data prep before it gets into your spreadsheet.

An easy Bake Oven
Image from Hasbro

Big Data Analytics - Using more advanced techniques to elevating your baking

We have a feel for the basics, but now we want to improve our process. There are certain things we do repeatedly, regardless of the recipes we’re making. We need to create an assembly line. Whether we’re calculating metrics or mixing the wet ingredients and the dry ingredients, there are steps we need to complete in a specific order and to a certain level of quality.  

We’ve moved on from the Easy Bake Oven and are now baking using a full-size oven. We need to be more careful about our measurements, ensure the oven is at the correct temperature, and write down notes about the various steps in our more complicated recipes. We need to make sure that different cakes, fillings, and decorations are ready at the same time, even if they have wildly different prep times. We need consistent results every time we make the cake, and we need others to be able to make the same recipe and achieve the same results by following our instructions.

Documenting and transferring this knowledge to others is difficult. Sure, you can write things down, but it is easy to skip a step that you take for granted. Perhaps your handwriting is hard to read, or someone is having trouble with the oven. You may be aware that using a hair dryer while baking can cause the circuit breaker to trip, but unless you document this information, others will not know this quirk. If your recipe has changed, such as using individual ingredients instead of a pre-made cake mix, it is important to clearly specify these adjustments, otherwise your friends and family may struggle to replicate your delicious cake.

There are plenty of data analytics tools out there that assist at this stage – Alteryx, Tableau Prep, and Datameer are just a few of many. In large enterprise organizations, you might find Informatica Power Center, Talend, or Matillion. These types of tools have graphical user interfaces (GUI); they give you the flexibility to extract data, load data, inspect data and transform it. Many enable you to define and calculate metrics. But they require you to work within each tool’s set of rules and constraints. This works well if you are starting and need something less complex.

The process that was once simple now is not; there are hidden assumptions, configurations, and requirements. You’re not using a pen-and-paper recipe anymore; now you’re working within a new system.  

GUI-based tools are great for companies whose workflows fit the way the systems work. But between the way some tools are licensed, and the skill needed to use them, they are only available to the IT organization. This leaves users to find shortcuts, develop workarounds, and become dependent on the business’ shadow IT. Inevitably, you’re going to run into maintainability issues.  

Recipes often have plenty of steps and ingredients; “data recipes” are no different. There are dependencies between operations, different run times for different transformations, different release cadences, and data availability SLAs. Your team might be able to manage your entire workflow quite well with one of these tools, but once your team starts introducing custom SQL logic or additional overlapping tools into the ecosystem, you introduce another layer of complexity. The result is an increase in the total cost of ownership.  

Often, this complexity is opaque, too. It is not obvious what that GUI component is doing, but they are strung together to build something usable. Over time, the complexity continues to grow; custom SQL logic is introduced, and more steps in the chain. Eventually, abstractions begin to form. The data engineers decide that these processing pipelines look quite similar and can be customized based on some basic configurations. Less overhead, more output.  

You’re now on the path of building custom ELT (Extract, Transform, and Load) pipelines, stitched together within the constraints of a GUI-based system. For some companies, this is okay, and it works. But there is a hidden cost – it is harder to maintain high quality inputs and outputs. The layers are tightly coupled and a mistake in one step is not caught until the whole pipeline is complete.  

IT may not be aware of downstream issues because they occur in other tools outside of their domain; one change here breaks something else there. This is like buying a ready-made cake mix and “enhancing it” with your custom ingredients. It works until it does not. One day, Duncan Hines changes their ingredients, and without you realizing, there is a bad reaction between your “enhancements” and the new mix.  Your once great recipe is not so great anymore, but there was no way for Duncan Hines to know. They expected you to follow their instructions; everything has been going according to plan until now.  

Even if your tool has strong version control built in, it’s often difficult to reverse the changes before it is too late. If your recipe calls for 1 cup of sugar but you accidentally add in 11, we don’t want to wait until the cake is baked to discover the error. We want to catch that mistake as soon as it happens.  

A Robot Baking Cake
Image from Pexels

Big Data Analytics - Building consistence into the baking process

Everything up to this point serves a specific profile of a business, but what happens when the business matures beyond what these tools offer? What happens when you know how to bake a cake, but struggle to consistently produce hundreds or thousands of the same quality cakes?

When we aim to expand our baking operations, it's crucial to maintain consistency among bakers, minimize accidental mistakes, and have the ability to swiftly recover from any errors that occur. We need enough mixers and ovens to support the demand for our cakes, and we need an organized pantry with the correct measuring cups and spoons. We need to know which ingredients are running low, which are delayed, and which have common allergens.  

In data analytics, we have our own supply chain, often called ELT. You will find tools like Airbyte and Fivetran are common choices for bringing in our data “ingredient delivery”. They manage data extraction and ingestion so you can skip the manual CSV downloads that once served you so well.

We want to ensure quality, have traceability, document our process, and successfully produce and deliver our cakes. To do this, we need a repeatable process, with clear sequential steps. In the world of baking, we use recipes to achieve this.  

All baking recipes have a series of steps, some of which are common across different recipes. For example, creaming eggs and sugar, combining the wet and dry ingredients, and whipping the icing are repeatable steps that, when performed in the correct order, result in a delicious dessert. Recipes also follow a standard format: preheat the oven to the specified temperature, list the ingredients in the order in which they are used, and provide the preparation steps last. The sequence is intentional and provides a clear understanding of what to expect during the baking process.

We can apply this model to our data infrastructure by using a tool called dbt (data build tool). Instead of repeating miscellaneous transformation steps in various places, we can centralize our transformation logic into reusable components. Then we can reference those components throughout our project. We can also identify which data is stale, review the chain of dependencies between transformations, and capture the documentation alongside that logic.

We no longer need a GUI-driven tool to review our data; instead, we use the process as defined in the code to inform our logic and documentation. Our new teammate can now confidently create her cake to the same standard as everyone else; she can be confident that she is avoiding common allergens, too.

Better yet, we have a history of changes to our process and “data recipe”. Version control and code reviews are an expectation, so we know when modifications to our ingredient list will cause a complete change in the final product. Our recipes are no longer scattered, but part of a structured system of reusable, composable steps.  

Multiple cakes being baked
Image from Pexels

Big Data Analytics - Are you ready to bake well, consistently?

Maturing the way we build data processes comes down to our readiness. When we started with our Easy Bake Oven, little could go wrong, but little could be tweaked. As we build a more robust system, we can take advantage of its increased flexibility, but we also need to maintain more pieces and ensure quality throughout a more complex process.

We need to know the difference between baking soda and baking powder. Which utensils are best, and which oven cooks most evenly? How do we best organize our new suite of recipes? How do you set up the kitchen and install the appliances? There are many decisions to make, and not every organization is ready to make them. All this can be daunting for even large organizations.

But you don’t have to do things all at once. Many organizations choose to make gradual improvements, transforming their big data process from disorganized to consistent.

You can subscribe to multiple SaaS services like Fivetran or Airbyte for data loading and use providers like dbt Cloud for dbt development. If your work grows increasingly complex, you can make use of another set of tools (such as Astronomer or Dagster) to orchestrate your end-to-end process.  You will still need to develop the end-to-end flow, so what you gained in flexibility you have lost in simplicity.

Professional baking a cake
Image from Pexels

Big Data Analytics - Becoming a master baker quickly

This is what we focus on at Datacoves. We aim to help organizations create mature processes, even when they have neither the time nor resources to figure everything out.

We give you “a fully stocked kitchen” - all the appliances, recipes, and best practices to make them work cohesively. You can take the guess work out of your data infrastructure, and instead, use a suite of tools designed to help your team perform timely and efficient analytics.

Whether your company is early in its data analytics journey or ready to take your processes to the next level, we are here to help. If your organization has strong info-sec or compliance requirements, we can also deploy within your private cloud. Datacoves is designed to get you “baking delicious cakes” as soon as possible.

Set aside some time to speak with me and learn how Datacoves has helped both small and large companies deploy mature analytics processes from the start. Also check out our case studies to see some of our customer's journeys.

Imagine yourself baking the next great cake at your organization. You can do it quickly with our help.

Master Bakers making deserts
Image from Pexels

Comparing cooking to data solutions you can trust
5 mins read

In 3 Core Pillars to a Data-Driven Culture, I discussed the reasons why decision makers don’t trust analytics. I then outlined the alignment and change management aspect to any solution. Once you know what you want, how do you deliver it? The cloud revolution has brought in a new set of challenges for organizations which have nothing to do with delivering solutions. The main problem is that people are faced with a Cheesecake Factory menu and most people would be better served with Omakase.

For those who may not be aware, The Cheesecake Factory menu has 23 pages and over 250 items to choose from. There are obviously people who want the variety and there is certainly nothing wrong with that, but my best meals have been where I have left the decision to the chef.

Omakase, in a Japanese restaurant is a meal consisting of dishes selected by the chef, it literally means “I'll leave it up to you”

Omakase leaves the decision to the chef
Omakase leaves the decision to the chef

How does this relate to the analytics landscape? Well, there is a gold rush in the analytics space. There is a lot of investment and there are literally hundreds of tools to choose from. I have been following this development over the last five years and if anything, the introduction of tools has accelerated.

This eye chart represents the ever growing list of analytics tools
This eye chart represents the ever growing list of analytics tools

Most people are where I was back in 2016. While I have been doing work in this space for many years the cloud and big data space was all new to me. There was a lot I needed to learn and I was always questioning whether I was making the right decision. I know many people today who do POC after POC to see which tool will work the best, I know, I did the same thing.

Contrast this process with my experience learning a web development framework called Ruby on Rails. When I started learning Rails in 2009 I was focused on what I was trying to build, not the set of tools and libraries that are needed to create a modern web application. That’s because Rails is Omakase.

When you select Omakase in Rails you are trusting many people with years of experience and training to share that knowledge with you. Not only does this help you get going faster, but it also brings you into a community of like-minded people. So that when you run into problems, there are people ready to help. Below I present my opinionated view of a three-course meal data stack that can serve most people and the rationale behind it. This solution may not be perfect for everyone, but neither is Rails. 

Appetizer: Loading data

You are hungry to get going and start doing analysis, but we need to start off slowly. You want to get the data, but where do you start. Well, there are a few things to consider.

·     Where is the data coming from?

·     Is it structured into columns and rows or is it semi-structured(JSON)?

·     Is it coming in at high velocity?

·     How much data are you expecting?

What I find is that many people want to over engineer a solution or focus on optimizing for one dimension which is usually cost since that is simple to grasp. The problem is that if you focus only on cost, you are giving up something else, usually a better user experience. You don’t have a lot of time to evaluate solutions and build extract and load scripts, so let me make this simple. If you start with Snowflake as your database and Fivetran as your Extract and Load solution, you’ll be fine. Yes, there are reasons why not to choose those solutions, but you probably don’t need to worry about them, especially if you are starting out and you are not Apple.

Why Snowflake you ask? Well, I have used Redshift, MS SQLServer, Databricks, Hadoop, Teradata, and others, but when I started using Snowflake I felt like a weight was lifted. It “just worked.” Do you think you will need to mask some data at some point? They have dynamic data masking. Do you want to be able to scale compute or storage independently? They have separate compute and storage too. Do you like waiting for data vendors to extract data from their system and then having to import it on your side? Or do you need to collaborate with partners and send them data? Well,Snowflake has a way for companies to share data securely, gone are the days of moving data around, now you can securely grant access to groups within or outside your organization, simple, elegant. What about enriching your data with external data sources? Well, they have a data marketplace too and this is bound to grow. Security is well thought out too and you can tell they are focused on the user experience because they do things to improve analyst happiness like MATCH_RECOGNIZE. Oh, and they also handle structured and semi-structured data amazingly well and all without having to tweak endless knobs. With one solution I have been able to eliminate the need to answer the questions above because Snowflake can very likely handle your use case regardless of the answer. I can go on and on, but trust me, you’ll be satisfied with your Snowflake appetizer. If it’s good enough for Warren Buffett, it’s good enough for me.

But what about Fivetran you say? Well, because you have better things to do than to replicate data from Google Analytics, Salesforce, Square, Concur, Workday, Google Ads, etc. etc. Here’s the full list of current connectors Fivetran supports. Just set it and forget it. No one will give you a metal for mapping data from standard data sources to Snowflake. So just do the simple thing and let’s get to the main dish. 

Finish your data appetizer and get the to the main dish
Finish your data appetizer and get the to the main dish.

Main dish: Transforming data

Now that we have all our data sources in Snowflake, what do we do? Well, I haven’t met anyone who doesn’t want to do some level of data quality, documentation, lineage for impact analysis, and do this in a collaborative way that builds trust in the process.

I’ve got you covered. Just use dbt. Yup, that’s it, simple, a single tool that can do documentation, lineage, data quality, and more. dbt is a key component in our DataOps process because it, like Snowflake, just works. It was developed by people who were analysts themselves and appreciated software development best practices like DRY. They knew that SQL is the great common denominator and all it needed was some tooling around it. It’s hard enough finding good analytics engineers let alone finding ones that know Python. Leave the Python to Data Science and first build a solid foundation for your transformation process. Don’t worry, I didn’t forget about your ambition to create great machine learning models, Snowflake has you covered there as well, check out Snowpark.

You will need a little more than dbt in order to schedule your runs and bring some order to what otherwise would become chaos, but dbt will get you a long way there and if you want to know how we solve this with our Datacoves, reach out, we’ll share our knowledge in our 1-hour free consultation.

A great meal starts with great ingredients
A great meal starts with great ingredients.

Dessert: Reporting on data

This three-course meal is quickly coming to an end, but I couldn’t let you go home before you have dessert. You need dashboards, but you also want self-service, then you can’t go wrong with Looker. I am not the only chef saying this, have a look at this.

One big reason for choosing Looker in addition to the above is the fact that version control is part of the process. If you want things that are documented, reused, and follow software development best practices, then you need to have everything in version control. You can no longer depend on the secret recipe that one of your colleagues has on their laptops. People get promoted, move to other companies, forget… and you need to have a data stack that is not brittle. So choose your dessert wisely.

Finish a great meal with dessert
Finish a great meal with dessert

Conclusion 

There are a lot of decisions to be made when creating a great meal. You need to know your guests dietary needs, what you have available, and how to turn raw ingredients into a delicious plate. When it comes to data the options and permutations are endless and most people need to get to delivering solutions so decision makers can improve business results. While no solution is perfect, in my experience there are certain ingredients that when put together well enable users to get to building quickly. If you want to deliver analytics your decision makers can trust, just go Omakase.

Get our free ebook dbt Cloud vs dbt Core

Get the PDF
Download pdf