Data Stack: The Ultimate Guide

Published:
July 21, 2023
Data Stack: The Ultimate Guide

Organizations across virtually every industry are turning to data to drive growth and maximize operational efficiency. But collecting, storing, analyzing, and governing data is easier said than done. To truly succeed with data, you need the right mix of talent and technology—commonly called the “data stack.”

In this guide, we’ll share a step-by-step process for developing a data stack at different data maturity stages. It’s the same process we used to enable:

  • Executive financial reports at a top VC fund, providing reliable performance insights about 400+ portfolio companies while saving 100+ man-hours during report generation.
  • Data-driven decision-making at a leading FBA aggregator for 5000+ brands, impacting stakeholders across M&A, marketing, SP&A, and brand management.
  • Enhanced blacklisting and whitelisting algorithm for a major content platform, resulting in an 18% performance improvement.
We’ve broken the guide into four main parts:
  1. Data maturity
  2. Data stack components
  3. Building your data stack
  4. Avoid common data stack mistakes

Let’s dive deeper.

Data Stack Diagram

Assess your organization’s data maturity level

Data tools seem to be everywhere. Where do you even begin? Start by assessing your organization’s data maturity level. Data maturity provides necessary context for allocating technical effort and financial resources. It’s also essential to know which parts of the data stack to focus on first.

“It’s not about the age of the company. It’s not about the size of the company. It really comes down to the business use case and whether they value data maturity. Is data just a resource, or is it an asset? A great example is Fortune 100 companies. Some are over a hundred years old. They had to go through a whole digital transformation just to get that. So, for them to move up in the maturity cycle is a lot more effort. You might actually find many that have departments that are really mature, but the company as a whole is on-prem and with really arcane code. On the other side for startups, if you’re looking at an ML company and your business value prop is, “We serve the best data” or “We have the best predictions,” by default, you’re going to need that level of data maturity to achieve that. So, it really comes down to where does data provide a competitive advantage for your business? […] I define maturity more as their ability to capture, process, and act upon data rather than their age or revenue or anything like that.”

–Mark Freeman, Founder of On the Mark Data

How can you assess data maturity? Some leaders bring in third-party firms to perform in-depth data maturity studies, but this is usually unnecessary. Taking a DIY approach is feasible, especially when you have the right framework. Below is Proxet’s framework for assessing data maturity. Feel free to use this as a template for your internal discussions.

Which “level” best describes your organization’s data maturity? Hopefully, you’re not “lost at sea”…

Data maturity level

“Lost at sea” (Level 0)

Imagine clinging to a life raft in the middle of a vast ocean. The sun’s rays beat down from above, and below is the dark abyss of the unknown. All around you is nothing but miles of non-potable water. This is how some business leaders feel when it comes to their data. On the one hand, data—like water in an ocean—seems to be everywhere: spreadsheets, operational systems, presentations, inboxes, and so on. Unfortunately, harnessing the data at scale isn’t feasible due to infrastructure issues and incohesive data strategies. Simply keeping your head above water amid crashing waves becomes the primary focus. Decisions are made based on the gut feelings of a few “experts.” Data errors are prevalent and practically impossible to overcome. It’s a scary place to be, but one that’s all too familiar for many business leaders. We consider this “level zero.”

“Rowboat” (Level 1)

Climbing into a rowboat lifts you out of the vulnerable position of being lost at sea. Now you can start moving in a specific direction instead of being completely tossed about by your surroundings. But, if you’ve ever been in a rowboat, you know that the rower’s eyes are facing the opposite direction of the voyage. Organizations are in a “rowboat” stage of data maturity when they’ve invested in modern data technology but most insights are about past performance. To complicate matters, one organization may have multiple leaders rowing different boats—in completely different oceans of data—instead of rowing together in a unified direction. Lots of effort, little progress.

“Sailboat” (Level 2)

Achieving a unified data fabric allows organizations to “raise the sails” and harness a more efficient source of propulsion. Aligning around fewer sources of truth empowers leaders to begin developing meaningful data strategies. Teams spend less time “rowing” and more time exploring use cases for data. Signs of the “sailboat” stage include data silo consolidation, faster time to insight, and an increased focus on what lies ahead.

“Speedboat” (Level 3)

Sailboats depend on the wind to keep moving, which is bad news when you’re stuck in the doldrums. To maintain a consistent speed, you need an engine. Getting leaders into the same boat that’s powered by a unified data engine enables cross-functional reporting and frees up even more time to answer complex questions like, “How are marketing campaigns and sales outreach impacting revenue?” Centralizing enterprise data in a single source of truth, empowering a growing number of BI users, and automating data pipeline workloads are signs you’re in the “speedboat” stage of data maturity.

“Luxury Yacht” (Level 4)

Only so many people can crowd into a speedboat at once. Harnessing real-time and unstructured data, supporting advanced data-driven use cases (customer 360 view, scenario analysis, etc.), and increasing your community of BI users requires the right blend of technology and engineering expertise. Upgrading to a “luxury yacht”—one that’s equipped with more storage capacity, computational power, and support for enhanced data governance—enables more data and analytics.

“Cruise Liner” (Level 5)

Cruise ships provide the most efficient means of transporting large numbers of people across the ocean. Once a cruise liner gets going, it’s hard to stop. It also sits higher up in the water—providing panoramic views of what’s ahead and behind. Organizations at the “cruise liner” phase of data maturity are actively pursuing advanced analytics, AI, and ML use cases to maximize data stack ROI.

Not sure which phase you’re in and what stack you should use? Book a free data strategy call with us to analyze your case.

Understand the key components of a modern data stack

What exactly is a data stack? Definitions vary from one expert to the next, but at Proxet we believe that a data stack includes all of the tooling necessary for an organization to successfully leverage data. Reporting systems might be top of mind for business leaders, but a modern data stack includes so much more.

Here are the primary components of a data stack. Tooling examples are included for each component, which is not an endorsement for specific vendors. We’ve done our best to categorize tools appropriately, but some solutions fall into multiple categories.

Key components of a modern data stack

Data Ingestion

Data comes in a variety of formats from many different sources. Sales and marketing records from your CRM, order data from your ERP, and journal entries from your accounting software are common examples. Before you can join these data sets together and generate meaningful insights, you first have to extract them from their source systems. However, building home-grown data pipelines and troubleshooting API failures can be time-consuming for even the most capable data engineering teams.

Data ingestion tools make it easier to copy data from one place to another in a secure, robust, and repeatable way. Examples include Airbyte, Azure Data Factory, Cloud Data Fusion, Fivetran, Glue, and Stitch.

Data Transformation

Extracting raw data from source systems offers minimal business value unless data is transformed into something more useful. Data transformation is a key step for ensuring data is formatted into a consistent, usable structure—regardless of where it came from. This allows organizations to join data from multiple systems and ultimately power views, reports, and innovative use cases. Whether you utilize an ETL (extract, transform, load) or ELT (extract, load, transform) strategy, data transformation requires considerable coding to ensure everything happens as expected.

Data transformation tools help engineers efficiently write and deploy their code. Examples include Apache Spark, AWS Glue, Data Analysis Expressions (DAX), and DBT.

Data Modeling

At Proxet, we consider data transformation to be a subset of data modeling. However, many people use the phrase “data modeling” to describe the process of creating new data sets—for data products, data marts, and other applications—from data sets that already exist. Data modeling is also important for ensuring that data is clean and reliable. Generally speaking, data modeling tools are also used for data transformation and vice versa.

Examples of data modeling tools are SQL, DBT, SqlDBM.

Solution Spotlight:

SqlDBM allows distributed teams to have a centralized and version-controlled logical/physical data modeling environment. The platform enables each user to create, manage and deploy enterprise data models with clarity, accuracy, and speed.

Data Orchestration

Complex data pipelines have a lot of moving parts. Ensuring that data is properly ingested, transformed, modeled, and used by downstream applications can be too much for one data engineering team to handle.

Data orchestration tools help engineers define what should happen (and when), ensuring that various data pipeline components interact in a predictable, efficient way. Examples include Apache Airflow, Cloud Composer, Dagster, MWAA, and Prefect.

Solution Spotlight:

Implementing Dagster helped StudioLabs achieve a flexible system for aggregating, normalizing, and enriching customer data to improve pipeline velocity. With Proxet’s help, StudioLabs streamlined its transformation and orchestration processes, resulting in less complexity and greater scalability.

Data Storage

Consolidating data from multiple systems into a single database reduces data silos and enables new analytical, BI, and predictive use cases. Some organizations utilize a data warehouse strategy, which aligns nicely with the need for enterprise reporting, visualizations, and ML/AI use cases. Others opt for data lakes, which tend to be less structured and may include a wider diversity of data types: images, videos, audio files, tables, etc. A “lakehouse” architecture is a hybrid of the data warehouse and data lake strategies, delivering the flexibility of data lakes with greater reporting capabilities.

Selecting the right data storage solution for your organization’s “single source of truth” depends on a number of factors, including your existing infrastructure, budgetary constraints, current and future use cases, data governance requirements, and technical expertise. Examples include Amazon Redshift, Azure Synapse Analytics, BigQuery (by Google), Mozart Data, and Snowflake. Integrated platforms like Databricks and Palantir Foundry also offer data storage solutions.

Data Visualization & Business Intelligence

Data visualization and BI analytics are top reasons why many organizations centralize their data into a data warehouse (or lakehouse). Sometimes casually referred to as “reports and dashboards,” an organization’s data visualization layer presents information in an intuitive way—usually in the form of charts, graphs, and data tables—so users and teams can understand performance, identify trends, and make informed decisions.

Connecting a data visualization tool to your data storage layer encourages data democratization with less manual effort. Examples include Looker, Microsoft Power BI, and Tableau.

Solution Spotlight:

Hometap needed to empower its marketing, financial, risk, and investor relations teams with real-time, trustworthy data. Proxet partnered with Hometap to build Tableau dashboards on top of Snowflake that integrated a variety of macroeconomic data to support real-time risk and return portfolio reporting, saving 100s of hours per month.

Reverse ETL

If data visualization and BI tools make data consumable for humans, you can think of reverse ETL as a way to make data consumable by software. Reports are good, but end users might want to see the data in systems they use every day (such as Salesforce or NetSuite) or use it to inform marketing automation campaigns, optimize programmatic ads, and increase operational efficiency. Reverse ETL tools also standardize integrations to CRMs and ERPs and take care of data scalability and deliverability, allowing data teams to concentrate on the warehouse-based transformations rather than maintaining custom connectors.

Tools to consider include Census, Grouparoo (recently acquired by Airbyte), Hightouch, and Twilio Segment. Palantir Foundry also provides a built-in capability to access the semantic layer, datasets, and models (jointly called the ontology) via REST APIs.

Data Versioning

Some data platforms, such as Palantir Foundry, offer built-in version control capabilities, allowing you to see how data has changed over time. Being able to “go back in time” and, in some cases, roll back changes is a nice feature to have—especially when something goes wrong.

Delta Lake, DVC, Hudi, and Pachyderm offer data versioning functionality for additional visibility and control.

Data Cataloging & Observability

Yes, we realize that data cataloging and data observability are two separate concepts. Data catalogs help companies organize their data for discoverability and governance. Data observability is about proactively identifying data quality issues, data anomalies, costly workloads, and other data-related problems.

All of that being said, there’s a lot of overlap between data catalogs and data observability solutions. Some observability tools perform similar functions as data catalogs, while some data catalogs integrate observability concepts into their interfaces. Examples include Acceldata, Amundsen, Anomalo, Atlan, AtScale, Castor, Dataplex, Monte Carlo, Purview, and Stemma, Select Star, Collibra.

Solution Spotlight:
“Data catalogs help by mapping artifacts derived from data via metadata. Think of tags that are added automatically at the time of creation. This ensures there are no loose ends: PDF reports, JPEG charts, any Word file or wiki page will all be automatically labeled with tags that codify the exact version of the data described in those artifacts, as well as the scripts used to process it. You can recreate previous analyses in a matter of seconds by navigating through metadata and pulling the associated artifacts. You could also modify those artifacts to include new variables, data sources, or processing. Organizations use Castor to automate blueprints of their data infrastructure that drives value from day one and evolves as they do. The biggest differentiator in their approach is that they are building a catalog that can be used by anyone (data, marketing, sales) and not just by data people.”

Louise de Leyritz, Content Lead at CastorDoc

Data Correction

Ensuring data cleanliness isn’t easy. Looking upstream to fix data quality problems is one option, but adjusting source systems can have unintended consequences for other users.

Data correction tools like Cleanlab, DataPrep and Glue DataBrew aim to solve this problem and are viewed as an increasingly important component within the modern data stack.

Solution Spotlight:
“Cleanlab Studio uses advanced algorithms out of MIT called confident learning to systematically improve your dataset with less effort by automatically finding and fixing issues in most types of real-world data (image, text, tabular, audio, etc.).”

Christopher Mauck, Data Scientist at Cleanlab

Artificial Intelligence and Machine Learning

No data stack guide would be complete without at least mentioning artificial intelligence and machine learning (AI & ML). With AI & ML, organizations aim to go beyond descriptive insights and move into predictive and prescriptive analytics.

At Proxet, we believe that AI and ML are less about tooling and more about building a proven process. Here’s a brief overview of our three-step process:

Step 1: Start with high-quality data

Garbage in, garbage out. That’s definitely true when it comes to ML. Before you invest any effort experimenting with machine learning models or platforms, be sure that you’ve implemented controls to ensure high-quality data.

Step 2: Define your ML goal(s)

Technology is rarely the biggest challenge for organizations that are attempting to leverage ML. The bigger issue goes back to unclear goals and missing data. Using large language models (LLMs) to make data more interactive, for example, will require a different approach—and different data—than building ML models for inventory forecasting. Be sure to set clear ML objectives that are backed by the right data.

Step 3: Select and use the right tooling

Armed with transparent ML goals and high-quality, relevant data, you’re ready to begin testing different tools. Solutions like Amazon SageMaker, DataRobot, MLflow, and Vertex AI (by Google) are a few examples to consider. We recommend that you start with pre-trained models and try implementing a couple off-the-shelf solutions. Evaluate if they deliver value before making additional investments in AI/ML engineering.

Start building your data stack

You’ve carefully assessed your organization’s data maturity level and familiarized yourself with modern data technology. Now you’re ready to begin building your data stack. At Proxet, we recommend taking a product mindset that’s focused on achieving rapid business value—beginning with an in-depth discovery process.

Discovery

Who are the internal and external stakeholders that will rely on your data stack? Do they currently utilize data to perform their jobs? If so, what types of data and where does it come from? What are users’ biggest frustrations with your current data infrastructure? Studying users’ behavior and their needs is an essential step for delivering tangible value with data—instead of building a “black box” that no one really cares about.

Selection Criteria

Don’t let vendors tell you what you need (or don’t need). Instead, create a vision document that outlines your data stack objectives based on the discovery process. Include any known technical requirements, such as specific types of visualizations or data pipeline workflows. Keep ease of use and customization in the forefront when formulating your selection criteria.

Build Vs. Buy

Opting for an “off-the-shelf” solution is usually the most efficient and effective approach. However, one system rarely fulfills all of an organization’s data needs. So, when is it smarter to build—rather than buy? Our friend Chad Sanderson, Chief Operator of Data Quality Camp and Scout at Sequoia Capital, offers this advice:

“If we have a need for our data to look a certain way or have certain requirements but nothing on the market could give us at least 80% of what we wanted, then we build. As soon as something comes on the market that satisfies all those requirements, we just buy it and we get rid of the thing that we built.”

– Chad Sanderson, Chief Operator of Data Quality Camp and Scout at Sequoia Capital

Proprietary Vs. Open Source

Many “off-the-shelf” solutions are proprietary technology, which may mean less engineering time spent on updates, support, and bug fixes. However, you won’t be able to access or modify the source code to your exact requirements. Open-source solutions, on the other hand, make their code publicly available. Developers are able to access, modify, and distribute the software, which can mean more control and flexibility.

Avoid these common mistakes while building your data stack

Data stacks have many potential points of failure. Stay on the lookout for these common mistakes as your organization builds and scales its data stack.

Misalignment with Business Goals

Your business goals should dictate the use of data (and the data stack), not the other way around. Unfortunately, some organizations forget to map their data strategies back to business processes—resulting in a data stack that provides minimal value.

Underinvesting in data quality

Reporting errors can quickly erode stakeholders’ trust in data. Rebuilding trust isn’t easy, which is why it’s better to invest early and often in data quality. That might require having uncomfortable conversations with leaders who seem less than motivated to build a best-in-class data stack. It might also mean moving slower than you—and your boss—would like.

Balancing the need for data quality and time to market is challenging, especially for startups. Startups often give preference to releasing new features rather than “cleaning up” data to support analytical, predictive, and other data-driven use cases. That might make sense from a revenue standpoint, but Mark Freeman, Founder of On the Mark Data, believes this approach is fraught with complexity and difficult to unravel:

“You get these two worlds where the database side is doing transactions for your products and the analytics side is trying to replicate logic. With that replication, you’re going to have a bunch of assumptions and mistakes. That’s going to cause a lot of data quality issues.”

– Mark Freeman, Founder of On the Mark Data

Overlooking data governance

“Let’s throw everything into the data lake and let anyone access it.” This approach sounds good in theory, but how can your organization guarantee data security and governance without implementing proper controls? Do front-line team members really need access to sensitive accounting data? Should sales reps be able to view colleagues’ performance review metrics? For most organizations, the answer is a resounding “no.” Aligning organizational governance standards with data stack capabilities is a critical step.

Failing to standardize

Data stack success depends on a willingness to make compromises for the greater good of the organization. Allowing three departments to use different BI tools might seem like the path of least resistance politically speaking. It’s also the least desirable option from a technology standpoint. Relying on multiple tools to solve the same problem creates scalability issues and consumes engineering resources, making it harder to solve actual business problems with data.

Need help building your data stack?

To recap, building a data stack involves more than just picking the right software vendor. It starts with carefully assessing your organization’s data maturity and gaining a clear understanding of the various components—from data ingestion to ML and everything in between. Following industry best practices and remaining vigilant for inevitable challenges along the way are also key steps.

Our experienced data engineering team at Proxet is ready to support your data project. Whether you need help assessing your data maturity, evaluating vendors, or building a best-in-class data stack, we can help.

Contact Proxet for a consultation.

Related Posts