The case for a data platform, and how to select one

Published:
October 11, 2022
The case for a data platform, and how to select one

Data is the new oil, as the saying goes. Increasing digitization of all the areas of the economy has made it easier to understand customers, discover unmet market needs, and optimize business workflows.

Scalable digitization, though, requires moving beyond spreadsheets and turning to data platforms. This article gives you an in-depth perspective into how we look at data platforms, a clear understanding of the whys, and some pointers to emerging industry standards.

Do I need a data platform?

Digitization presupposes tabular data — think of your financial records but imagine keeping your customer support interactions and marketing impressions in a similar structure. All tabular data can be handled using the same tools that have traditionally been used for financial analysis: Excel, databases, or Business Intelligence (BI) applications.

For the sake of this article, let's imagine that your organization is focused on increasing sales by offering discounts. You're hoping to leverage data (tabular, of course) to understand if your discount program is actually working. If you want to calculate total profit, you would require at least two columns: price and discount. However, determining true success of the program likely depends on evaluating a variety of factors that go beyond total profit—and that's difficult to do in a simple spreadsheet. Understanding why will make us realize the need for a data platform.

Getting the right data to avoid flawed assumptions

When you don’t have all the right data, it’s easy to make incorrect decisions based on flawed assumptions—like assuming that all items make the same profit. If higher-grossing products outnumber lower-grossing ones while the discount was running, total profits may lead one to believe that discounts drive sales up. In reality, the higher average prices are skewing the results.

Failing to consider seasonality could be another mistake. Think of Christmas: profits are likely to be higher than during other parts of the year. Discounts during Christmas compared to other buying seasons could seem to work great, but in reality it’s not an accurate comparison.

Data analysis of this type depends on comparing one dimension at a time and keeping all other dimensions constant (i.e. the same items during the same periods). And, to do that, you need more than just price and discount in your spreadsheet.

Adding two columns seems simple enough

If you’re like every other company Proxet works with, the necessary columns won’t all be on the same spreadsheet. Instead, the information is probably spread around multiple databases, legacy systems, Google Drive folders, data lakes, Excel spreadsheets, SharePoint folders, and other data silos!

Connecting these data silos together is tempting, especially as useful APIs and data partners keep popping up all over the Internet. But, owning and maintaining a big cluster of silos and integrations can be complex and expensive. One-off connections are hard to scale and, more importantly, inconsistent logic will make it difficult to realize any tangible value until your data silos are unified under a common information schema.

Can’t we just copy the new columns manually?

Manually copying and pasting data could work as a one-time solution, but this approach requires the team to do so every time fresh data is needed. It also introduces three new problems:

  1. Data would quickly become obsolete.
  2. Data would originate from sources with minimal or no visibility.
  3. Data would likely be ridden with an ever-growing number of errors, unless you’re comfortable assuming the perfect execution of all manual tasks. (That’s like assuming pro athletes never get injured!)

Going beyond “Copy and Paste”

Let’s start by addressing obsolete data. Why is it such a problem? Well, if your discounts stop working, you probably won’t know  until somebody updates the spreadsheet again. Depending on the size and scope of the promotion, you could lose thousands of dollars per day until the next monthly meeting. Automating the flow of data through a process called data ingestion keeps your data fresh with far less manual effort.

Next let’s talk about visibility. The number of data sources can escalate quickly as new stores, products, offers, promotions, and pricing strategies are introduced. That’s why exclusive reliance on the original data source cannot tell the entire story.After all, it is very likely that new sources have been created but are not integrated to the spreadsheet. Without proper ingestion and automation, the owner of the spreadsheet will likely remain unaware of developments by other teams or new products. And, if the very owner of the spreadsheet does not have a firm understanding what data is in use, available, or missing, neither will the rest of the organization—hence the lack of visibility. Data observability helps solve this challenge by tracking which data sources are being used.

Analytics will only be as good as the quality of your data

Once the data is centralized and fully accounted for, quality-checking it becomes simpler and faster. Automating these processes to run at regular intervals as the data is ingested— without human intervention—makes the process even easier. No direct supervision would be required to determine if all the data attributes are within the previously observed range. That’s the basic idea behind automated data quality monitoring.

Generating and sharing data visualizations and reports

Displaying the results of your analysis on a visualization tool could make it easier to explore your data and make informed decisions. Or, maybe you’d like to include your visualizations in a report, accompanied by documentation that explains what each chart means. Combining your spreadsheet with state-of-the-art BI software can make all of this possible. The final deliverable can then be exported and uploaded to your company’s cloud so that anyone can access it. But this statement hides two problematic assumptions: anyone and access.

Anyone

It is very unlikely that every person with access to the company’s Google Drive folder should also have access to the report about discount information and product pricing strategy. That’s why proper data governance—controlling access to certain information through user roles, credentials, authentication, and other means—is essential.

Access

Being able to passively read the summary report on your company’s cloud is a good start, but that excludes interaction with live data. Once your analysis is converted into a permanent  report, it becomes a static snapshot that is less helpful with each passing day.

Tweaking even a single variable and re-running the analysis with updated data will require someone to manually repeat every step of the process—virtually ensuring an unrepeatable process and uncomparable results. In a sense, every time you create an artifact based on the data, such as a report or a visualization, you’re also creating a small two-way data silo: the documentation is isolated from the data, and the data is isolated from the documentation.

Data catalogs help with this by mapping artifacts derived from data via metadata. Think of tags that are added automatically at the time of creation. This ensures there are no loose ends: PDF reports,JPEG charts, any Word file or wiki page will all be automatically labeled with tags that codify the exact version of the data described in those artifacts, as well as the scripts used to process it. You can recreate previous analyses in a matter of seconds by navigating through metadata and pulling the associated artifacts. You could also modify those artifacts to include new variables, data sources, or processing.

What are my options?

There are two end-to-end data platforms on the market that we consider state of the art: Palantir Foundry and Databricks. Here’s a quick overview as you consider your options:

Palantir Foundry is an end-to-end solution providing robust capabilities for the needs discussed in this post, including a solid system of roles for data governance, the ability to share data between accounts, and highly advanced data catalog through their Ontology tool. Historically, Palantir Foundry pricing had put it out of reach for most businesses that are not Fortune 500, but lately the company started offering more accessible options to startups and SMBs through the Foundry for Builders program. It is still among the most expensive solutions, but you get an end-to-end, fully integrated platform with top tier data governance and security natively built in.

Databricks on the other hand, is a much more accessible platform. As opposed to Foundry, you can just sign up for a Databricks demo online. More importantly, the platform is modular, so you can pick and choose whatever modules you want, and not pay for the functionality you don’t need. However the modularity means you’ll have to spend more engineering time on the integrations, and as you start to use more modules and process more data, the spend can quickly ramp up to Palantir levels.

Databricks is highly customizable and has a thriving ecosystem. Based on open-source, it benefits from a large (and growing) community of users who improve the main product on a daily basis by adding new features, documentation, and providing support. A highly modular design also means that any team of developers can easily extend it and build custom tooling on top of it. If your data analytics involves highly specific processing or modeling challenges that may require custom data pipelines, Databricks might actually be your best option.

Outside of these end-to-end data platforms, there are numerous providers offering specialized solutions with their own strengths, weaknesses, and focus areas. Sometimes a specialized tool is better suited for your specific use case than the general purpose functionality of a data platform; below are some examples of the tools we consider best in class for specific use cases.

For data ingestion, Airbyte, Fivetran, and Steampipe are great options. The first two also come with ready-made connectors and templates that allow users to ingest data from popular data services faster and more easily, significantly decreasing the overhead of setting up the necessary infrastructure.

Data ingestion is not trivial when schema-less (unstructured) data is involved: all business data is represented according to some schema, which means all incoming data must be mapped onto the same schema before it can be effectively leveraged. Companies like Tecton, Hopsworks, and DataRobot cover both data transformation and Machine Learning operations (MLOps). dbt provides general-purpose data transformation solutions. John Snow Labs is doing some amazing work on Spark NLP, which provides specialized solutions for processing text in the healthcare, financial, and legal domains. Many other start-ups and established companies exist in this space, from task-specific ones like speech recognition to open-domain natural language understanding providers (Primer, fastText, NLP Cloud or Eigen Technologies). This topic alone would require an entire post so we will leave it here.

For data ingestion automation (often referred to as data orchestration), good options are Airflow (Astronomer) and Dagster.

For data warehousing, some of the industry-leading solutions include Amazon Redshift, BigQuery by Google, and Snowflake.

In the area of data observability, two main groups arise, one providing data cataloging and governance solutions (AtScale, Atlan, Amundsen / Stemma), and another group focusing instead on data quality (Monte Carlo Data, Anomalo).

Finally, in terms of data visualization and general BI capabilities, some of the key players to consider are PowerBI, Sisense, Tableau and Streamlit.

Final thoughts

As the scale and complexity of your business increases, there is a good chance it will eventually make sense to migrate your data analytics to a dedicated platform. Sooner or later, as information sources, data volumes, and representation schemas multiply all around the Internet, spreadsheets and even simple databases will feel like pre-industrial technology in an industrialized world.

Without a data platform, you will still find what you need, but it will probably be too late to use it to your advantage. And there is no point in reacting to the insights in your data—data’s key value lies in prediction. Data platforms, through their built-in aggregation and automation capabilities, can help turn data into an idea assembly line.More importantly, they do so while removing the hassle of setting up databases, connectors, ETLs, CI/CD pipelines, tests, dedicated QA, as well as all the work needed to put together a talented team of on-demand professionals.

You have built a team optimized for your business case. Now, your data analytics deserves a solution built by a team optimized for crunching the numbers.

Related Posts