A Guide to Enterprise Data Analytics. How the Two Major Cloud Platforms Work, and More Importantly

Historically, running a business meant having an office and employees on premise. Today, a physical presence is not necessary, but a digital presence surely is. Achieving business goals, revenue growth, customer satisfaction, competitive advantage all depend on the ability to leverage data. Around 75% of businesses have invested in data analysis tools; most have had data analysis in place for several years. Let’s take a look at the most prominent cloud platforms — Microsoft Azure and Amazon AWS — and see what they are about.

WIFM: What’s In It For Me?

While advances in technology excited curiosity and imagination in the general public, business owners are interested in the specific, immediate, and tangible business benefits and want them clearly pointed out. In other words, what are the concrete gains data analytics platforms provide?

Computing: scale-on-demand capability
Scalable storage
Site and network reliability and redundancy
Security, security, security
Instant access to reports
High data-processing capability
Predictive capabilities with artificial intelligence/machine learning

Microsoft Azure and Amazon Web Services both effectively manage data and turn it into valuable insights. So what do they do exactly, and what differentiates them?

The AWS Tech Stack

Giants such as Netflix and Airbnb use AWS to process their arrays of records in order to give users recommendations in real time. They choose AWS because of its ability to rapidly handle and analyze accumulated data. Netflix and Airbnb uses these insights to quickly deliver content to users, increasing engagement and satisfaction, and ultimately retention and revenue.

This process is slowed down by the low speed of the data pipelines. Usually, data crews (even the most savvy ones) approach the task manually.

Here’s what AWS real-time stream processing looks like:

To maximize the value of using AWS, it is essential to understand the nature of each element of this chain and how they work:

Kafka Streaming. Named after one of the most intriguing and gifted writers of the 20th century, Kafka Streams presents a client library for creating applications and microservices. Kafka clusters are used to store the input and output data. It is a well-balanced combo of the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka’s server-side cluster technology.
S3/EMRFS Storage. The EMR File System (EMRFS) is an HDFS implementation used by all Amazon EMR clusters. EMRFS reads and writes regular files from Amazon EMR directly to Amazon S3. Amazon S3 provides persistent data storage with Hadoop integration. Persistent data storage is more convenient and enables necessities such as consistent view and data encryption.
Spark Streaming is an extension of the core Spark API that allows for processing of live data streams in a scalable, high-throughput, and fault-tolerant manner. Input can be derived from many sources and processed with the help of sophisticated algorithms. They are normally expressed with top-level functionalities—map, reduce, join, and window. Spark first processes the data streams, then delivers the processed data to databases, file systems and live dashboards. Spark’s ML and GraphX can be applied to Spark data streams.

Apache Airflow is a workflow orchestration platform used to author, schedule and monitor workflows programmatically. The greatest strength of this platform is its simple and engaging interface. The interface makes it easy to understand current production pipelines, view progress, and schedule alerts about problems. Transforming workflows into codes makes them sustainable and testable, and also makes collaboration streamlined and seamless.
Apache Presto presents a high-performance, distributed Structured Query Language tool for big data. Unlike some other query tools, users can query data from multiple data sources—Kafka and Cassandra, to name a few—within a single query.
BI Tool generates ML-based business intelligence and pushes it across a business via interactive dashboards. These dashboards can be accessed from any device, and embedded in apps and portals. This allows real-time insights and data to be accessed swiftly and easily.

The Azure Tech Stack

Giants such as Adobe, HP, and BMW use Microsoft Azure for its flexibility, reliability, scalability, and productivity. Azure Databricks comes with many perks, including Apache Spark optimization, support of multiple languages and libraries, autoscale/autoterminate, integration with other Azure services, deep learning optimization, and a robust workspace for collaboration. Azure customers have access to a solid data warehouse, real-time analytics, and advanced big data analytics.

“To create, deliver, and manage targeted content at scale can involve overwhelming volumes of data. Adobe and Microsoft share ideas about how to help our customers meet that challenge.”

‍—Tim Waddell, Director of Product Marketing, Adobe (formerly Director of Marketing Analytics, Microsoft)

Basically, what happens is:

You bring data which falls into several categories: structured, semi-structured, and unstructured
Azure Databricks processes the unstructured data and merges it with structured data
Scalable ML and deep learning are applied to extract valuable insights
Native connectors are leveraged between Azure Synapse Analytics and Azure Databricks to access the data, then move it at scale
Root cause determination and raw data analysis is performed via inbuilt capabilities of Databricks
Situational queries are run directly on data
The derived insights from Databricks are brought to Cosmos DB in order to ensure their accessibility in various apps both web and mobile.

Let’s look at the specific components of Azure.

Data Factory is an extract-transform-load data integration service that allows for creating data-driven workflows. These workflows orchestrate movement of data and transform data at scale. Data Factory can easily create pipelines, and schedules to ingest data from various data stores. Raw data is structured and organized.

Data Lake Storage used to be known as Azure Data Lake Storage. Data Lake Storage represents a high-scale and enterprise-wide repository used for storage of big data analytical workloads. It’s secure and compliant, high performance, supports a rich data store, and is easy to set up.
Data Bricks (Spark base language) is an Apache Spark-based platform for analytics. It is fast, collaborative, and easy.
Data Warehouse (Azure Synapse Analytics) is a cloud data warehouse known for its flexibility, high speed, and reliability. Users can query data on their terms. Among the major features we should point out are its limitless scale, powerful insights, unified experience, and security.

“Our adoption of the Azure Analytics platform has revolutionized our ability to deliver insights to the business. We are very excited that Azure Synapse Analytics will streamline our analytics processes even further with the seamless integration the way all the pieces have come together so well.”

—Nallan Sriraman, Global Head of Technology, Unilever

BI Tool (Power BI) is a comprehensive set of business intelligence tools for extracting insights across an enterprise. A user can easily access a lot of data sources at once, quickly prepare the necessary data, and carry out situational analysis. Aside from the fact that the produced reports are valuable, they are beautiful to look at.

Cooler Screens: A Case Study

To understand what Cooler Screens is you have to imagine a huge iPad installed as a door. The screens are equipped with sensors. When they detect a person approaching, the merchandising switches on: the customer sees all the products that are in this particular cooler on a display. The technology of this project is “identity-blind,” which means it does not access personal and sensitive information of consumers. Instead, Cooler Screens senses the physical proximity of a shopper plus weather conditions, time, day, and location via its sensors.

CoolerScreens’ Advertisers, Platforms and Audiences

“Our adoption of the Azure Analytics platform has revolutionized our ability to deliver insights to the business. We are very excited that Azure Synapse Analytics will streamline our analytics processes e“The content is like a chameleon. It adjusts to be contextually relevant for the individual environment and consumer target. We start in the cooler aisle by replacing the old glass cooler doors with new digital smart doors or smart screens with a merchandising platform built in. The platform seamlessly integrates into the existing retail environment, so that consumers can instantly and easily access the most relevant and up-to-date information just as if they were online.”ven further with the seamless integration the way all the pieces have come together so well.”

—Arsen Avakian, Co-Founder and CEO, Cooler Screens

Data Sources

Cooler Screens engaged Proxet to work on the Advertisement and Analytical Platforms. The process followed these steps:

POS data — 300-400Mb csv file, timeline — every 30 minutes
Impressions data – 2-3 Gb csv file, timeline — every day
Cooler Screen Snapshot data – dictionary json file (~500 Mb), Cooler Screen API by request
Quividity data – Azure noSQL DB, (5-10 million lines) every day
Broadsign Resources – dictionary csv file (10Mb), by request
SnECoefficients – dictionary, SQL Server DB
Location, Advertisement Type, Stores – dictionaries, CosmoDB, by request

Architecture Design

Achievement And Time Frames

Defined requirements and created product roadmap – 1 month.
Created full infrastructure with data migration from the Data Sources to Data Lake Storage – 2 weeks.
Created tables in DWH, data migration from Data Lake to DWH, and stored procedures and views in DWH – 2 weeks
Created 4 reports in Power BI, and tested all stages and reports – 2 weeks

Total project duration: 2.5 months

Performance

Daily data transfer from all data sources: 2-3 minutes
Running reports for a month with grouping and filtering: 2-4 sec.

Cost

Azure DWH: $4300 per month
Azure Data Factory, Data Lake, and Power BI: ~$200-$300 per month
Development Team (System Analyst, Data Engineer, QA Tester): $30,000 per month

Team Structure

1 System Analyst
1 Data Engineer
1 QA Tester

And now about the actual impact of the technology. Brands that implemented Cooler Screens in their stores saw a 50-100% increase in revenue. People really want to see digital solutions and screens much more than old-school doors. They are eye-catching and attractive even if you are pretty far from the coolers. The smaller brands get a chance to convey a relevant message to a targeted audience, and the shopping experience acquires contextual relevance without compromising data privacy. Cooler Screens is a new wave of customer experience with significant upside for participating businesses.

‍

All Posts

August 7, 2024

Modern data platforms will improve your organization’s agility

Modern data platforms can significantly improve organizational agility. Learn how cloud computing and advanced technologies streamline data analytics and decision-making.

December 13, 2023

Webinar: Key components of a modern data stack

Learn a step-by-step framework for constructing an optimal modern data stack — hear Proxet's CTO cover crucial elements like build vs buy choices, open source tools, typical mistakes, and how we can assist.

July 21, 2023

Data stack: The ultimate guide

Build a modern data stack by following best practices from data engineering experts. Learn about data maturity, data stack components, and how to build.