Tooling

Deep dives into modern data tools, platforms, and integrations to power your data stack.
5 open source data quality tools
5 mins read

Data is in the spotlight as companies everywhere realize data's true potential. With big initiatives like GenAI and sophisticated data ecosystems, ensuring data quality is not just a necessity but a mandatory investment for businesses and analysts worldwide. Some people are learning the hard way that you need stable data foundations to get the results these initiatives promise.

stable data foundations

While there are many great tools out there, the spotlight on open source tools has never been brighter. Open source software offers transparency, adaptability, and community-driven enhancements that are crucial in the rapidly evolving data landscape. This article covers 5 open source data quality tools and is current as of April 2024, so if that is something that interests you, stick around.

What is data quality?

First things first, what is data quality?  There are many definitions of data quality, but data is considered high quality if it is fit for its intended uses in operations, decision-making, and planning. In other words, data quality refers to the data's accuracy, completeness, reliability, relevance, and how up-to-date it is. In the context of data-driven decision-making, high-quality data is crucial as it directly impacts the accuracy of insights and the effectiveness of decisions. Our data foundation.

High-quality data should be:

Accurate: Data that is free from errors and discrepancies.

Complete: Data that covers the necessary breadth and depth needed by the business.

Reliable: Data that has no missing elements and is consistently represented and sourced.

Relevant: Data that is applicable to the context and purposes for the business.

Current: Data that is up-to-date and timely for its purpose.

Understanding what is needed for data quality is the first step toward recognizing the importance of these tools and practices that maintain or enhance this quality.

What should you track for good data quality?

Now we know what constitutes high quality data but what do we need to monitor to ensure that our data is high quality? The good news is these metrics tend to be universal. For maintaining high data quality, several metrics and elements should be monitored regularly:

Accuracy: Ensure that your data correctly represents reality or the source from which it came.

Completeness: Check for missing values or data segments that could lead to incorrect analysis or conclusions.

Consistency: Data across different systems or platforms should match and be consistent.

Timeliness: Data should be updated and available in a timeframe that aligns with its intended use.

Validity: Data should adhere to the relevant rules, such as data formats and value ranges.

Uniqueness: No duplicates should be present unless necessary, ensuring each entry is unique.

Integrity: There should be a relationship between datasets and records that maintains data accuracy and consistency.

By tracking these metrics, organizations can set up the essential data foundation and significantly improve the trustworthiness and utility of their data. This will lead to better outcomes and insights that can support great data initiatives of the future.  

Since it is essential to track these metrics, companies are on the search for the best tool to help them improve their data quality. Here is a list of open source tools that can be leveraged to improve data quality.

Why open source?

Before we jump into the tool list you may have noticed that a quick google search for this topic will give me many different lists. How is our list different? Well, we are focusing on open source tools. There are many great tools out there both paid and “free” and we put quotes around free because there is no such thing as free; there are always hidden costs (hours worked) for setup and maintenance. However, we wanted to make this open source tool list because regardless of the hidden costs we believe in the following benefits of open source tools:

Transparency: Open source tools offer complete transparency in their operations and algorithms. Users can inspect, modify, and improve the code, which enhances trust and reliability.

Community: Open source projects benefit from the collective intelligence of a global community. This not only accelerates innovation and bug resolution but also provides a large pool of knowledge and support.

Flexibility: With open source, organizations are not locked into proprietary systems, allowing them to tailor tools to their specific needs and integrate them seamlessly into their existing environments.

Cost-effectiveness: While open source doesn't always mean free, it significantly reduces costs associated with licensing fees and vendor lock-in, making cutting-edge tools accessible to everyone.

Quality and Security: Continuous contributions and scrutiny by the community mean that open source tools often meet high standards of quality and security, with issues being identified and addressed rapidly.

Our selection of open source data quality tools is grounded in rigorous open source criteria. We believe that the strength of an open source project lies not just in its ability to solve complex problems but also in its community, transparency, and commitment to ongoing improvement. When compiling this list, we considered factors such as active community engagement, frequency of updates, the quality of the documentation, and ease of contribution. This ensures that the tools recommended not only meet high standards of performance and reliability but also embody the principles that make open source software a valuable asset to the data quality landscape. So without further ado, let's jump into our list.

Open source data quality tool list

dbt Core

Self managed or in Datacoves

Primary Language: SQL / YAML

Purpose & Features: dbt core is an open source tool that allows data analysts and engineers to transform data in their data warehouses by writing dynamic SQL queries, which dbt then converts into tables and views. It also supports version control, testing, and documentation, which helps maintain data integrity and reliability.  

For data quality, dbt Core has some out of the box data tests which can be extended through custom made test, or by using libraries such as dbt-expectations and elementary. Testing is easily done by configuring macros in YAML files or by writing custom SQL tests. However, integrating dbt Core into your data stack can be a big task especially when it comes to scheduling. A managed dbt Core platform such as Datacoves could be a great option for saving time and money. While dbt handles only the 'T' in ELT, Datacoves’ managed dbt Core Platform ensures that the entire ELT process is smooth and interconnected, allowing your team to concentrate on deriving insights from the data. There are other dbt alternatives on the market that can also be explored which handle the "T" in the ELT process.

Who it is for: Best for teams using SQL who want to transform data directly in the warehouse and who want to follow software development best practices including unit testing in their data pipelines.

Image from Datacoves UI
ui dbt test

Soda Core

Primary Language: YAML

Purpose & Features: Soda Core is the open source component that allows users to define data quality checks in code and integrate them into workflows.  

Who it's for: Teams that need data quality checks integrated into their existing Python workflows or data pipelines.

soda core CLI

Great Expectations

Primary Language: Python  

Purpose & Features: This tool is a data quality platform that allows you to create data tests, documentation, and profiles automatically. It easily integrates into existing data processing pipelines to ensure data validation against expectations (unit tests). You can collaborate with nontechnical stakeholders by sharing the Data Docs. Data docs are Expectations, Validation Results, and other metadata translated into a human readable format as seen in the image below.  

Who it's for: Data teams looking for a Pythonic way to enforce data quality rules and create automated data documentation.

Great Expectations Data Docs

Deequ

Primary Language: Scala (for Apache Spark)

Purpose & Features: Deequ is an open source tool by Amazon with which you can define "unit tests" (columnar or row level) for large-scale data within the Spark ecosystem. It allows for automated checks of data quality metrics such as completeness, uniqueness, and conformity. This enables data teams to find errors early before they are consumed downstream. You can use Deequ to define your assumptions about the data in unit tests to catch any data that does not meet your assumptions. This tool works on tabular data such as CSV files, databases tables, logs and flattened JSON files.  

Who it's for: Data engineers and scientists working with big data in Spark (billons of rows), particularly those focused on maintaining data quality at scale.

Deequ unit tests

Data Diff

Primary Language: You don’t manually write data quality tests but as you make changes to your SQL data diff will work its magic.  

Purpose & Features: This tool is a little different from the rest because you're not exactly writing tests to catch data quality issues. Instead, this open source Python package by Datafold lets you do development testing by spotting the differences between tables whenever you tweak your code. It's a great way to compare what's happening in your production data against your development changes, helping you see directly how those code changes are playing out in the data.  

Who it's for: Data engineers and teams who need to ensure that changes in data processing and ETL logic do not negatively affect data quality.

Data Diff CLI

Choose the right open source data quality tool for your use case

The concept of "the best" for data quality tools is inherently tied to specific use cases. What might be an ideal solution for one organization could be less effective for another, depending on the unique challenges and requirements each face.

Before you dive into a tool, it's crucial to understand your organization's specific data quality challenges. Are you dealing with high volumes of data, requiring scalability? Or are your main issues related to data consistency and accuracy in a smaller, more controlled dataset? Identifying your primary use case will help you navigate through our top 10 tools and select the one that best fits your situation.  

Here’s 6 tips to help you make an informed choice for your specific needs:

1. Assess Your Data Quality Needs:

Identify the primary issues you're facing with your data. Are you struggling with incomplete data, inconsistencies, outdated information, or data that's not in the right format? Understanding your main challenges will guide you toward a tool that specializes in addressing those specific problems. Once you understand your data quality challenges and objectives, match these with the strengths of the tools listed above

2. Consider Your Technical Environment:

Evaluate the technical stack you are currently using. Some data quality tools are better suited for certain environments or integrate more seamlessly with specific databases, data lakes, or processing frameworks. Choose a tool that aligns with your existing infrastructure to reduce integration headaches.

3. Evaluate Community and Support:

The strength of an open-source tool lies in its community. Look for a tool with an active community, which is evident through regular updates, vibrant forums, and extensive documentation. A strong community can provide invaluable support, from troubleshooting to best practices.

4. Check for Flexibility and Scalability:

Your data needs will evolve, so it’s important to choose a tool that is flexible and can scale with your business. Assess the tool’s ability to handle different data volumes, types, and sources. A good open-source tool should not only solve your current data quality issues but also adapt to future challenges.

5. Review Security and Compliance Features:

Data security and compliance are imperative. Be sure the tool complies with data protection regulations and offers security features to protect your data. This is especially important if you're dealing with sensitive or personal information.

6. Test Drive the Tool:

Finally, don’t hesitate to get your hands dirty. Most open-source tools are free to use, so take advantage of this by testing the tool with your data. This will give you a clear idea of the tool’s usability, effectiveness, and fit with your use case. Be sure to go into this with an open mind to get the most out of the tool.

Conclusion

In the era of generative AI and other lofty initiatives high-quality data is not just an option but a necessity, and embracing these open-source data quality tools can significantly enhance the reliability and accuracy of your data. Remember, the "best" tool is one that aligns closely with your specific use case offering the features and flexibility your team needs to effectively tackle your data quality challenges; it very well could be a combination of these tools. Whether you are in the world of SQL, Python, or any other programming language, there is a tool tailored to your needs. Consider factors such as ease of integration into your current data ecosystem, the learning curve for your team, and the level of community support available.

10 best data transformation tools for a smoother etl and elt process
5 mins read

Data transformation tools turn raw data into reliable, analytics-ready datasets, but choosing the right one requires understanding how transformation fits into your full data pipeline. Modern teams do not just transform data. They also orchestrate jobs, enforce quality, manage deployments, and ensure reliability at scale.

This guide explains what data transformation tools actually do, how popular options compare, and how to evaluate them as part of an end-to-end data stack rather than in isolation.

Evaluating data transformation tools requires looking beyond features and understanding how each tool fits into a production data pipeline.

Data transformation in the End-to End process

Data transformation is the process of converting data from one format or structure to another. It improves the performance of data processing systems and compliance with data governance regulations.  

Data transformation is just one of the steps on the road to deriving value from data.

The end-to-end process includes the following steps:

  • Data Extraction: To extract data from sources such as databases and APIs 
  • Data Loading:  To load data into a desired destination such as a Data Warehouse.
  • Data Transformation: To cleanse and transform data into usable insights based on business needs. 
  • Data Orchestration: To schedule, automate, and coordinate your ELT pipeline end to end. Learn why orchestration is essential for modern data teams.
  • Data Delivery: To visualize and support decision-making 
  • Data Observability: To view and get alerted when data issues occur

It’s worth taking each of these steps into consideration when determining the best data transformation tool for your organization.

There is a common misconception that the tool alone will solve all the problems. 

However, using the right tools without addressing the underlying processes can lead to a data mess that can exacerbate the underlying issue, costing more time and money. This data mess could easily be avoided in the first place, not just by having the right tools but by also having the modern best practices in place.

What is the difference between ETL and ELT? 

Both help businesses extract, load, and transform data, but the sequence of events is different with their pros and cons.

  • ETL Process: The traditional approach to data transformation. Data is Extracted and Transformed before it gets Loaded. 
  • ELT Process: The modern approach to data transformation. Data is Extracted and Loaded before it gets Transformed. 

ELT is generally more effective than ETL processes because it removes the uncertainty of not having the necessary data for future use cases and offers more flexibility in the long term. Since storage is typically affordable, it makes more sense to simplify the ingestion process.

10 best data transformation tools

Here’s a list of the top data transformation tools to manage the ETL process: 

  • Datacoves
  • dbt Cloud
  • Apache Airflow
  • SAS
  • SQLMesh
  • Informatica
  • Talend
  • Azure Data Factory
  • Matillion
  • Alteryx

Each of these tools falls into one of two categories: code-based or visual/drag-and-drop interface. Both have their own set of pros and cons, which we’ll go through below. 

Code-based tools for data transformation 

Code-based tools allow you to transform data by using SQL or Python to explicitly define the transformation steps. Although it requires knowledge and experience, visual tools don’t negate the need to know SQL. This approach gives users a high degree of flexibility and control, and simplifies the maintainability and validation of work before releasing it to production. 

Moreover, it is simpler to trace each data transformation step without having a disconnected document explaining what the transformation “should” do.

1. Datacoves

After having multiple conversations with data teams at enterprise companies, the challenge of developing and orchestrating dbt pipelines is a topic that has come up on numerous occasions. 

There are a lot of tools to figure out when it comes to implementing the best practices for digital transformations and custom applications. It’s not uncommon for companies to end up with more than one SaaS platform and tool than they had initially planned. We built Datacoves to eliminate this need by providing the following: 

  • Managed dbt Core
  • Open-source technologies, meaning there is no vendor lock-in
  • Managed SaaS or private cloud deployment

Datacoves focuses on helping companies accelerate growth by providing a complete ELT solution, including orchestration and visualization. Therefore, the learning curve for data transformation is minimized because of our best-practice accelerators and the available tool integrations to form an end-to-end platform. 

Top features

  • Managed dbt Core: Get full access to dbt through Datacoves, where we provide a structured process for developing dbt pipelines. Configure a dbt environment where you can write data transformations in modular code so it’s easier to test and maintain. 
  • Hosted dbt Documentation: Simplified dbt docs deployment
  • Managed Airflow: Orchestrate data using Airflow, which is an integration that’s available to give companies the flexibility they need for an end-to-end process from data load to activation.
  • VS Code in Browser: No software installation is required, allowing you to write and edit code from anywhere as long as you have access to a web browser.
  • Deployable in Private Cloud: Accelerate data transformation and minimize ownership costs while complying with corporate and governmental regulations. 
  • Internal Tool Integrations: Integrate with internal tools like Active Directory, Bitbucket, Jenkins, GitLab, and more. 
  • CI/CD: Deliver high-quality data to users faster with more reliability and efficiency. 

How data transformation works with Datacoves

Here is the extended version of the ELT process with Datacoves: 

  • Extract data
  • Load data
  • Transform data
  • Observe
  • Orchestrate
  • Analyze 

Develop modular code and track version changes that you and your team can view. You’re also able to validate the quality of data transformations with our built-in testing frameworks and generate documents to leave a record of how you’re transforming data.

Datacoves Platform Dashboard
Datacoves Platform Dashboard

You develop in a VS Code environment that can be configured with a vast array of VS Code extensions and Python libraries All the modern data tools you need are provided in a structured workspace: 

  • GitHub workflows
  • Automation scripts
  • Loading
  • Scheduling
  • Data security
  • Data transformation 
Datacoves VS code environment
Datacoves VS code Environment

Is Datacoves right for you?

It’s suitable for medium and large companies that lack the expertise or don’t want to create and manage complex data processes and need the flexibility that complex enterprise processes require. 

Data teams can use all the components provided within the dbt ecosystem in a structured, methodical way with Datacoves. This means you’ll have a simplified dbt experience, yet you’ll still see the same results of dbt when used to its full potential. 

Smaller companies also gain competitive advantages with Datacoves because they’ll be able to implement DataOps, follow best practices, and get a fully managed VS Code environment accelerating time to value. 

If you would like to know more about how Datacoves can help, you can schedule a demo here

2. dbt Cloud

dbt Cloud allows businesses to build and maintain data pipelines. It’s a cloud-based platform with a web-based IDE that allows you to transform data within a cloud data warehouse. They can help you reduce the time spent setting up an end-to-end solution.

Notable features 

  • Modular Code: Write data transformations in modular code. 
  • Version Control: Monitor changes to your code and seamlessly collaborate with team members. 
  • Data Testing: Built-in testing framework for validating the quality of data transformations. 
  • Documentation: Generate documents for data transformations so you know how your data is being transformed. 
  • CI/CD: Streamline the data transformation process. 

Is dbt Cloud right for you? 

dbt Cloud works well for organizations looking to reduce the time and effort required to transform data pipelines. 

Since dbt Cloud is a web-based IDE, it may feel limited for data teams that would rather use a VS Code environment. Moreover, dbt is not deployable in a company’s private cloud. It also typically requires other SaaS tools for complicated data pipelines, making it more difficult to manage unless you have the necessary integration experience with each of those SaaS tools. 

Most importantly, dbt Cloud is focused solely on the data transformation step of the ELT process. Hence, you are unable to load VS Code extensions nor additional Python libraries. An enterprise with any level of complexity will also need a full-featured orchestrator. 

3. Apache Airflow

Apache Airflow is an open-source platform for workflow management. You can orchestrate and schedule data pipelines. It’s a scalable and flexible platform that’s based on Python. You can also define your own operators with Airflow. 

Notable features 

  • ELT pipelines: Airflow is a tool for organizing full ELT pipelines. But, it still requires your expertise to build them using Python. 
  • CI/CD tools: You can use CI/CD tools to assist with deployment. 
  • Data Extraction and Load: With airflow, you can create Python scripts that extract and load information from other data sources. 
  • Python Libraries: Airflow can be paired with Python libraries like Pandas to shape and aggregate data.
  • Machine learning: Schedule the training of machine learning models. 

Is Apache Airflow right for you?

Apache Airflow works well for those needing a scalable data transformation tool with an open-source platform. It’s particularly a good choice for businesses mainly using Python to manage their data. 

However, Airflow is primarily an orchestrator. That means you may end up building complex code in your data pipelines. Therefore, developing and maintaining this complexity requires experience and technical expertise. Managing the infrastructure for Airflow is not trivial and also requires an understanding of tools like Docker and Kubernetes.

4. SAS

SAS is a solution that allows you to transform and prepare data for analysis. It offers a wide range of features for data transformation, including data cleaning, data integration, and data mining.

Notable features

  • Data Cleaning: Remove duplicate records, correct errors, complete missing values, and so forth. 
  • Data Integration: Combine data from different departments or systems into a single dataset. 
  • Data Mining: Identify patterns and trends in your data. 
  • Data Visualization: Create charts and graphs to visualize data. 

Is SAS right for you?

SAS is ideal for companies with complex datasets, such as those in financial services, healthcare, and retail industries. Additionally, it’s ideal for professionals with advanced skills and knowledge in data transformation. 

With that in mind, there are better solutions than SAS for those less experienced in programming and data management, as SAS licensing can be quite expensive.  

5. SQLMesh

SQLMesh is a complete DataOps solution for data testing and transformation. Teams can use SQLMesh to collaborate on data pipelines when transforming data. 

Notable features

  • Semantic Understanding: SQLMesh can understand the SQL written so you can write code efficiently and avoid errors.
  • Simplified CI/CD: SQLMesh can identify the changes made to data pipelines and apply only the necessary updates to each environment.
  • Column-level Lineage: Get a better understanding of the relationships between your data and the transformation process.
  • Transpilation: Run your SQL on multiple engines so that it’s easier to migrate data into a new platform.

Is SQLMesh right for you?

SQLMesh is well-suited for businesses with SQL and Python expertise that need to collaborate on complex data transformations and pipelines. Although other open-source tools are available, teams can use SQLMesh to maintain data quality and perform unit testing of their transformations.

SQLMesh may not be ideal when you only need to perform simple data transformations. In this case, there are other more straightforward tools available. Moreover, SQLMesh may not be for you when your primary focus is on real-time data processing.

Visual ELT tools for data transformation

Visual tools make the ELT process more straightforward by removing the need to manually write code. It works by dragging and dropping pre-built components into a canvas. This makes them ideal for data teams who aren’t as experienced in programming. 

The biggest advantage of graphical tools for ETL is that people who are less comfortable with code can use them. Conversely, drag-and-drop tools typically don’t offer the same level of flexibility and control as code-based tools, which can complicate the process of debugging data pipelines and long-term maintenance.

6. Informatica

Informatica helps you turn your data into an asset. It’s a cloud-based or on-prem solution for data management with numerous data transformation libraries and APIs available. 

Notable features 

  • PowerCenter: Enterprises can use this to manage large and complex data pipelines. 
  • Cloud Data Integration: Cloud-based integrations that allow you to move data. 
  • Data Engineering Integration: A solution designed to assist data engineers with code development, version control, and CI/CD.
  • Data Engineering Streaming: Manage streaming data pipelines with data ingestion, processing, and visualization. 

Is Informatica right for you?

Informatica can be a good choice for large enterprises and data professionals looking to quickly transform large volumes of complex data using an on-premise solution. It can also be a good choice for companies that need to comply with industry-specific data standards. 

However, it may be too complicated to use for some organizations. Informatica requires a team of experienced data engineers with the necessary skills and experience. DataOps can also be a challenge. Since you’ll be dealing with multiple things simultaneously, it’s easy to get lost in the process when you don’t have the full technical expertise.

Moreover, it’s an expensive solution. There are other more affordable alternatives.

7. Talend

Talend is a cloud-native platform deployable on public cloud solutions such as AWS, Azure, and GCP. They also offer an on-prem solution and provide a variety of components and custom connectors for data transformation. 

Notable features 

  • Talend Open Studio: An open-source data integration tool for smaller workloads. 
  • Talend Data Fabric: Manage the data integration process. Maintain data quality and governance. 
  • Cloud Data Integration: Cloud-based data integration service with a graphical user-friendly interface for creating and managing data transformation tasks. 
  • Built-in Data Catalog: Discover new data assets across your organization. 

Is Talend right for you?

Talend works for most businesses and data professionals. It’s particularly well-suited for those who need to:

  • Transform data from a variety of sources.
  • Migrate data to a new system. 
  • Build and maintain a data warehouse. 
  • Check and resolve data quality issues 

Still, you may want to consider other options when prioritizing DataOps and performing highly specialized data transformations such as machine learning or NLP. Talend enterprise licenses may also be costly.

8. Azure Data Factory

Azure Data Factory helps you simplify the data transformation process at scale. You’re provided with a code-free and code-centric experience for orchestrating data transformation pipelines. 

Notable features 

  • Built-in connectors: A variety of built-in connectors for popular data sources are available. 
  • Data Orchestration: Schedule your data pipelines.
  • Built-in Components: Use built-in components to reshape data.

Is Azure Data Factory right for you?

Azure Data Factory could be the right option for data professionals working within the Azure ecosystem. Azure may be worth considering when you’re looking into data warehousing using Azure Synapse and Azure DataOps and not just ELT. 

However, Azure Data Factory might not be the best option when you’re on a budget. As with any visual ELT tooling, DataOps and pipeline maintainability may be more complex leading to an increased total cost of ownership.

9. Matillion

Matillion is a cloud-based data transformation tool that provides you with on-premises databases, cloud applications, and SaaS platform integrations. 

Notable features

  • Cloud-native architecture: Matillion runs in the cloud and allows you to push down transformation logic to leverage the scalability and performance of cloud data platforms such as Amazon Redshift and Snowflake. 
  • Visual Interface: Create data pipelines using a graphical interface, reducing the need to write code. 
  • Library of Connectors: Access a library of pre-built transformations and connectors across a range of data sources. 
  • High-Code and No-Code: Supports both hide-code and no-code development, making it accessible for beginner and intermediate users. 
  • dbt Component: With the dbt component, you can embed dbt within a Matillion pipeline.

Is Matillion right for you?

Matillion’s pre-built connectors and visual interface makes it an ideal solution for less experienced data professionals. The disadvantage is that it can be costly for businesses on a budget. Moreover, you must ensure that Matillion supports your specific requirements and how you intend to perform the data transformations. Care must be given to the long-term maintainability of pipelines that are both visual and code-based.

Getting started with Matillion is simple because they use a drag-and-drop interface for building data pipelines. But like with any other visual tool, there is still a learning curve and it’s typical to have a mix of code and visual components in a production data pipeline.

10. Alteryx

Alteryx simplifies the data transformation process. You can automate advanced analytics and prepare data through self-service. It’s an effective solution that makes it easier for teams to collaborate. Unlike the other visual tools above which are typically used by Data Engineers in IT, Alteryx is more widely adopted in less technical departments of an organization. It’s also typically paired with visualization tools like Tableau. 

Notable features 

  • Drag-and-drop User Interface: The visual interface makes building and collaborating on data transformation workflows easier.
  • Data Loading: Connectors to popular databases and services allow you to integrate different data sources 
  • Machine Learning: Alteryx may also be used to create simple machine learning models in a visual way

Is Alteryx right for you?

Alteryx is a good option to help ensure teams are on the same page throughout the data workflow. Data transformation projects can be shared and feedback provided seamlessly, making collaboration easier. 

The downside is that Alteryx is costly compared to the other tools on this list. Moreover, there is still a bit of a learning curve, even if you’re experienced in data analytics. You should also check that Alteryx aligns with teams for effective collaboration.

How Datacoves can help you transform data

Data transformation is a process that’s prone to multiple errors along the way. While many tools listed can help you reduce friction, they must be carefully evaluated. With Datacoves, you’ll be able to implement best data practices and DataOps so that you have a smooth process with a minimized learning curve. 

If you’d like to learn more about how Datacoves helps you accelerate time to value, you can schedule a free demo here.

Transformative power of data modeling in data
5 mins read

I read an article on Anchor Data Modeling, more specifically, Atomic modeling where the author proposes a different way of Data Modeling. The rationale for this change is that there is a lack of skills to model data well. We are giving powerful tools to novices, and that is bound to lead to problems.

From the article:

"we are in a distressful situation both concerning the art as a whole but also its place in modern architectures"

Is this the case? Do we have a big problem on the horizon that requires us to make this big shift?

I'd say I am open-minded and expose myself to different ways of thinking so I can broaden my views. A few years ago, I learned a bit about COBOL, not because I had any real use for it but because I was curious. I found it very interesting and even saw its similarities with SQL. I approached the topic with no preconceived ideas; this is the first time I heard of Atomic Modeling.

Is there something wrong with atomic and anchor data modeling in general?

The issues I see with ideas like Atomic data modeling are not in their goal. I am 100% aligned with the goal; the problem is the technology, process, and people needed to get there.

What we see in the market is a direct result of a backlash against doing things "perfectly." But why is this the case?  I believe it is because we haven't communicated how we will achieve this vision of ideas like atomic data. The author even says a key phrase in the first paragraph:

"practitioners shying away from its complexity"

If doing anchor data modeling is "complex" how are we going to up-skill people? Is this feasible? I am happy if I can get more people to use SQL vs a GUI tool like Alterix 😁

Doing anchor data modeling is complex

I am by no means an expert. Yet, I am fairly technical, and if I am not convinced, how will we convince decision-makers?

As I read this article, here's what I am thinking:

1. First, I will need to deconstruct the data I get from some source like material master data form SAP. That will be a bunch of tables, and who is going to do all this data modeling? It sounds expensive and time-consuming.

2. I am going to need some tooling for this, and I am either going to build it or use something a few others are using. Will my company want to take a chance on something this early? This sounds risky.

3. After I deconstruct all this data, I need to catalog all these atoms. I now have a lot of metadata, and that's good, but is the metadata business-friendly?  We can't get people to add table descriptions how is this going to happen with this explosion of objects? Who will maintain it? How will we expose this? Is there a catalog for it already? Does that catalog have the other features people need? It sounds like I need to figure out a bunch of things, the biggest one being the change management aspect.

4. What sort of database will I use to store this? This is a great use case for a graph database. But graph databases are not widely adopted, and I have seen graph databases choke at scale. We can use a relational database, but these joins are going to be complex.  Someone may have figured all this out, but there's more tech and learning needed. It sounds like this will also add time and expense.

5. When I have managed to do all the above, I will need to construct what people can actually use. We need objects that work with tools that are available. I need to make relational tables I can query with SQL and viz tools, which are more user-friendly. This sounds like more work, more time, and more money.

I may have missed some steps and oversimplified what's needed for this type of change. I am also aware that I may not know what exists to solve all the above. However, if I don't know it, then there are a ton of other people who also don't know it and this is where we need to start. We need to understand how we will tactically achieve this "better" world.

How we tactically achieve a better world

What are we fixing?

I've had conversations on metadata-driven automation, and like atomic modeling, I am not clear on who we are helping and how. What are we improving and in what timeframe? In the end, it feels like we have optimized for something only a few companies can do. To do anchor modeling well would be a huge expense, and when things go wrong, there are several points of failure. When we look at business problems, we need to be sure to optimize the end-to-end system. We can't locally optimize one area because we are likely moving the problem somewhere else. This can be in terms of time, money, or usability.

Decision-makers are not interested in data modeling. They are expecting results and a faster time to market. It's hard enough getting people to do things "better." This is why I find it hard to imagine that we can get to this level of maturity any time soon.

What can we do instead?

There are incremental steps we can take to incorporate best practices into the modern data stack. We need to help people mature their data practice faster, and we should not let perfection get in the way of good. Most companies are not large enterprises with millions of dollars to spend on initiatives like atomic modeling. That being said, I have yet to see anchor modeling in practice, so I welcome the opportunity to learn. I remember years ago the debates about how Ruby on Rials was teaching people "bad practices."  The other side of that argument is that Rails helped companies like Twitter and Github launch businesses faster. Rails was also better than the alternative at the time, which included messy PHP code. Others advocated for well-crafted "scalable" and expensive Java applications. Rails may not be the powerhouse it once was, but it has had a huge influence on how we build software. I even see its influence in dbt even if it might not have been intentional or direct.

Conclusion

Tools like Snowflake and dbt allow us to build processes that are much better than what most people have. Should we focus on all the "bad" things that may come with the modern data stack? Should we focus on how practitioners are not well educated, and so we need to throw all they are doing out?

I don't think so; I believe that we can help companies mature their data practices faster. Will we have the best data models? Maybe not. Will users do things perfectly? Nope. But can we help them move faster and guide them along their journey to avoid big pitfalls? I think we can. Getting people to use git, automating testing, and creating DataOps processes is a huge step forward for many organizations. Let's start there.

There's a reason Data Mesh and the Modern Data Stack resonate with so many people. There's a desire to do things faster with more autonomy at many companies, not just the ones with multi-million-dollar budgets. Let's focus on what is achievable, do the best we can, and help people mature along the way. We don't need more complexity; we need less.

Team work on data stack
Dataops top tools
5 mins read

In continuation of our previous blog discussing the importance of implementing DataOps, we now turn our attention to the tools that can efficiently streamline your processes. Additionally, we will explore real-life examples of successful implementations, illustrating the tangible benefits of adopting DataOps practices.

Which DataOps tools can help streamline your processes?

There are a lot of DataOps tools that can help you automate data processes, manage data pipelines, and ensure the quality of your data. These tools can help data teams work faster, make fewer mistakes, and deliver data products more quickly.

Here are some recommended tools needed for a robust DataOps process:

  1. dbt (data build tool): dbt is an open-source data transformation tool that lets teams use SQL to change the data in their warehouse. dbt has a lightweight modeling layer and features like dependency management, testing, and the ability to create documentation. Since dbt uses version-controlled code, it is easy to see changes to data transformations (code reviews) before they are put into production.  dbt can dynamically change the target of a query's "FROM" statement on the fly and this allows us to run the same code against development, test, and production databases by just changing a configuration. During the CI process, dbt also lets us run only the changed transformations and their downstream dependencies.
  2. Fivetran: Fivetran is an Extract and Load(EL) tool that has been gaining in popularity in recent years since it removes the complexity of writing and maintaining custom scripts to extract data from SaaS tools like Salesforce and Google Analytics.  By automating extracting data from hundred's of sources Fivetran removes complexity freeing data engineers to work on projects with a bigger impact. Finally, Fivetran has a robust API which allows you to save configurations done vie their UI for disaster recovery or to promote configurations form a development to a production environment.
  3. Airbyte: Airbyte is another data-ingestion EL tool that is appealing because it is open source, requires little or no code, and is run by the community. It also makes it easier to extract and load data without having to do custom coding.  Airbyte also offers a connector development kit to help companies build custom connectors that may not be available. This allows companies to leverage most of the Airbyte functionality without too much work. There's also an API that can be used to retrieve configurations for disaster recovery.
  4. SQLFluff: SQLFluff is an open-source SQL linter that helps teams make sure their SQL code is consistent and follows best practices. It gives you a set of rules that you can change to find and fix syntax errors, inconsistencies in style, and other common mistakes. Sqlfluff can be added to the CI/CD pipeline so that problems are automatically found before they are added to the codebase. By using a tool like SQLFluff, you can make sure your team follows a consistant coding style and this will help with long term project maintainability.
  5. dbt-checkpoint: dbt-checkpoint provides validators to make sure your dbt projects are good. dbt is great, but when a project has a lot of models, sources, and macros, it gets hard for all the data analysts and analytics engineers to maintain the same level of quality. Users forget to update the columns in property (yml) files or add descriptions for the table and columns. Without automation, reviewers have to do more work and may miss mistakes that weren't made on purpose. Organizations can add automated validations with dbt-checkpoint, which makes the code review and release process better.
  6. Hashboard: Hashboard is a business intelligence (BI) product built for data engineers to do their best work and easily spread the data love to their entire organizations. Hashboard has an interactive data exploration tool that enables anyone in an organization to discover actionable insights.
  7. GitHub: GitHub offers a cloud-based Git repository hosting service. It makes it easier for people and teams to use Git for version control and to work together. GitHub can also run the workflows needed for CI/CD and it provides a simple UI for teams to perform code reviews and allows for approvals before code is moved to production.
  8. Docker: Docker makes it easy for data teams to manage dependencies such as the versions of libraries such as dbt, dbt-checkpoint, SQLFluff, etc.. Docker makes development workflows more robust by integrating the development pipeline and combining dependencies simplifying reproducibility.

Examples of companies who have successfully implemented DataOps

Dataops in Top Companies

DataOps has been successfully used in the real world by companies of all sizes, from small startups to large corporations. The DataOps methodology is based on collaboration, automation, and monitoring throughout the entire data lifecycle, from collecting data to using it. Organizations can get insights faster, be more productive, and improve the quality of their data. DataOps has been used successfully in many industries, including finance, healthcare, retail, and technology.

Here are a few examples of real-world organizations that have used DataOps well:

  1. Optum: Optum is part of UnitedHealthcare. Optum prioritizes healthcare data management and analytics and when they wanted to implement new features and apps quickly, they turned to DataOps. DataOps helped Optum break down silos, saving millions of dollars annually by reducing compute usage. Optum managed data from dozens of sources via thousands of APIs, which was its biggest challenge. A massive standardization and modernization effort created a scalable, centralized data platform that seamlessly shared information across multiple consumers.
  2. JetBlue: DataOps helped JetBlue make data-driven decisions. After struggling with an on-premises data warehouse, the airline migrated to the cloud to enable self-service reporting and machine learning. They've cleaned, organized, and standardized their data and leveraged DataOps to create robust processes. Their agility in data curation has enabled them to increase data science initiatives.
  3. HubSpot: HubSpot is a leading company that makes software for inbound marketing and sales. It used DataOps to improve the use of its data. By using a DataOps approach, HubSpot was empowered to do data modeling the right way, to define model dependencies, and to update and troubleshoot models, which resulted in a highly scalable database and opened up new data application possibilities.
  4. Nasdaq: Nasdaq, a global technology company that provides trading, clearing, and exchange technology, adopted DataOps to improve its data processing and analysis capabilities. They launched a data warehouse, products, and marketplaces quickly. After scalability issues, they moved to a data lake and optimized their data infrastructure 6x. The migration reduced maintenance costs and enabled analytics, ETL, reporting, and data visualization. This enabled better and faster business opportunity analysis.  
  5. Monzo: Monzo is a UK-based digital bank that used DataOps to create a data-driven culture and improve its customer experience. By letting everyone make and look at the different data maps, they are helping teams figure out how their changes affect the different levels of their data warehouse. This gave the Monzo data team confidence that the data they give to end users is correct.

What is the future of DataOps adoption?

Future of DataOps

DataOps has a bright future because more and more businesses are realizing how important data is to their success. With the exponential growth of data, it is becoming more and more important for organizations to manage it well. DataOps will likely be used by more and more companies as they try to streamline their data management processes and cut costs. Cloud-based data management platforms have made it easier for organizations to manage their data well. Some of the main benefits of these platforms are that they are scalable, flexible, and cost-effective. With DataOps teams can improve collaboration, agility, and build trust in data by creating processes that test changes before they are rolled out to production.

With the development of modern data tools, companies can now adopt software development best practices in analytics. In today’s fast-paced world, it's important to give teams the tools they need to respond quickly to changes in the market by using high-quality data. Companies should use DataOps if they want to manage data better and reduct the technical debt created from uncontrolled processes. Putting DataOps processes in place for the first time can be hard, and it's easier said than done. DataOps requires a change in attitude, a willingness to try out new technologies and ways of doing things, and a commitment to continuous improvement. If an organization is serious about using DataOps, it must invest in the training, infrastructure, and cultural changes that are needed to make it work. With the right approach, companies can get the most out of DataOps and help their businesses deliver better outcomes.

At Datacoves, we offer a suite of DataOps tools to help organizations implement DataOps quickly and efficiently. We enable organizations to start automating simple processes and gradually build out more complex ones as their needs evolve. Our team has extensive experience guiding organizations through the DataOps implementation process.

Schedule a call with us, and we'll explain how dbt and DataOps can help you mature your data processes.

Get our free ebook dbt Cloud vs dbt Core

Get the PDF
Download pdf