Big Data Analytics
&
Data Science
Theory and Implementation
Presented By :
Alfandy Sulaiman Barkah, S.T
DevOps & Data Service Lead
@ ATI Business Group
Agenda 08 Big Data Challenges
01 Definition of Big Data 09 Definition of Data Science
10 Differences Between Big Data
02 Characteristic of Big Data Analytics and Data Science
03 Differences Between Big Data and 11 Characteristic of Data Science
Traditional Data
04 Definition of Big Data Analytics 12 Introduction of Big Data Tools
05 Four Main Big Data Analysis Methods 13 Big Data Ecosystem
06 Data Platform Journey 14 Big Data Use Case Customer Profiling
07 Operationalizing Big Data Analytics 15 Big Data Use Case Sentiment Analytic
01. Definition of Big Data
What is Big Data?
Big data refers to extremely large and complex data
sets that cannot be easily managed or analyzed
with traditional data processing tools, particularly
spreadsheets. Big data includes structured data,
like an inventory database or list of financial
transactions; unstructured data, such as social
posts or videos; and mixed data sets, like those
used to train large language models for AI.
Variety
Variety refers to the many types of data that are
02. Characteristic of Big Data available. Traditional data types were structured
and fit neatly in a relational database. With the
rise of big data, data comes in new unstructured
data types. Unstructured and semistructured
data types, such as text, audio, and video, require
What is Characteristic of Big Data? additional preprocessing to derive meaning and
support metadata.
Traditionally, we’ve recognized big data by three
characteristics: variety, volume, and velocity, also Veracity
known as the “three Vs.” However, two additional Data reliability and accuracy are critical, as
decisions based on inaccurate or incomplete
Vs have emerged over the past few years: value and data can lead to negative outcomes. Veracity
veracity. refers to the data's trustworthiness,
encompassing data quality, noise and anomaly
Volume detection issues.
The sheer volume of data generated today, from
social media feeds, IoT devices, transaction Value
records and more, presents a significant Big data analytics aims to extract actionable
challenge. insights that offer tangible value. This involves
turning vast data sets into meaningful
information that can inform strategic decisions,
Velocity
Velocity is the fast rate at which data is received uncover new opportunities and drive innovation.
and (perhaps) acted on. Normally, the highest Advanced analytics, machine learning and AI are
velocity of data streams directly into memory key to unlocking the value contained within big
versus being written to disk. data, transforming raw data into strategic assets.
02. Characteristic of Big Data
What Types of Data are Processed in Big Data?
Structured Data
Structured data refers to highly organized information
that is easily searchable and typically stored in relational
databases or spreadsheets. It adheres to a rigid schema,
meaning each data element is clearly defined and
accessible in a fixed field within a record or file.
02. Characteristic of Big Data
What Types of Data are Processed in Big Data?
Unstructured Data
Unstructured data lacks a pre-defined data model, making
it more difficult to collect, process and analyze. It comprises
the majority of data generated today, and includes formats
such as:
● Textual content from documents, emails and social
media posts
● Multimedia content, including images, audio files and
videos
● Data from IoT devices, which can include a mix of
sensor data, log files and time-series data
02. Characteristic of Big Data
What Types of Data are Processed in Big Data?
Semi-structured Data
Semi-structured data occupies the middle ground
between structured and unstructured data. While it does
not reside in a relational database, it contains tags or other
markers to separate semantic elements and enforce
hierarchies of records and fields within the data
02. Characteristic of Big Data
Type of Data Schema? Storage Method Examples Processing Tools
Structured Yes SQL Databases Financial records, CRM MySQL,
data PostgreSQL
Semi-Structured Partial NoSQL & Hybrid JSON, Emails, IoT Sensor MongoDB,
Systems Logs Cassandra
Unstructured No Data Lakes & Videos, Images, Social AI, NLP, Hadoop
Cloud Storage Media
03. Differences Between Big Data and Traditional Data
The key difference between big data analytics
and traditional data analytics lies in the type of
data processed and the tools used.
Traditional Data Big data analytics
Traditional Data analytics focuses on structured Big Data Analytics handles vast amounts of
data stored in relational databases, relying on structured, semi-structured, and unstructured
statistical methods and tools like SQL for querying data, requiring advanced techniques such as
and analysis. machine learning and data mining. It often
employs distributed systems like Hadoop to
manage large-scale processing.
04. Definition of Big Data Analytics
What is Big Data Analytics?
Big data analytics is the systematic processing of
large, complex data sets to uncover valuable
insights. It helps identify trends, patterns, and
correlations in raw data, enabling informed
decision-making. Organizations use big data
analytics to harness massive amounts of
information from sources like IoT sensors, social
media, financial transactions, and smart devices,
applying advanced techniques to generate
actionable intelligence.
05. Four Main Big Data Analysis Method
These are the four methods of data analysis at work within big data:
05. Four Main Big Data Analysis Method
Descriptive Analytics: Understanding "What Happened"
Descriptive analytics focuses on summarizing past data to identify
trends and patterns. It helps businesses understand historical
performance by analyzing key characteristics of their data.
Example Use Case:
A company reviewing sales data might discover a seasonal surge
in video game console purchases every October through
December. Descriptive analytics highlights this trend, offering
insights into consumer behavior.
05. Four Main Big Data Analysis Method
Diagnostic Analytics: Understanding "Why It Happened"
Diagnostic analytics focuses on identifying the root causes behind
observed trends and patterns in descriptive analytics. By analyzing
data in depth, organizations can uncover areas for improvement,
implement changes, and optimize business processes.
Example Use Case:
In digital marketing, a sudden drop in website traffic requires
diagnostic analysis to determine the underlying issue. By
examining user behavior, businesses can identify and resolve
problems to prevent further losses in engagement and conversions.
05. Four Main Big Data Analysis Method
Predictive Analytics: Anticipating "What Will Happen"
Predictive analytics utilizes historical data, statistical modeling, and
machine learning to forecast future trends, helping organizations
make informed decisions.
Example Use Case:
In sales forecasting, businesses analyze past sales data, market
trends, and customer behavior to estimate future revenue and
demand. This enables companies to craft personalized marketing
campaigns, optimize product launches, and strategically position
themselves to maximize growth opportunities.
05. Four Main Big Data Analysis Method
Prescriptive Analytics: Guiding "What to Do"
Prescriptive analytics goes beyond predicting future outcomes by
providing actionable recommendations for optimizing decisions
and strategies. It leverages insights from previous analytics to
suggest the best course of action.
Example Use Case:
Companies use prescriptive analytics to segment customers and
determine the best marketing approach for each group. It can
suggest optimal product recommendations, personalized
promotions, and ideal communication channels, ensuring
businesses connect with customers more effectively.
06. Data Platform Journey
06. Data Platform Journey
What is Data Warehouse?
A data warehouse is a centralized, structured repository
designed to store and manage structured data from various
sources for business intelligence (BI) and analytics.
Characteristics:
Optimized for SQL-based queries, analytics, and reporting.
Uses schema-on-write, meaning data is cleaned, transformed,
and structured before storage.
Provides high-performance querying and fast access to
well-organized data.
06. Data Platform Journey
What is Data Lakes?
A data lake is a scalable storage system designed to hold raw,
unstructured, semi-structured, and structured data in its native
format, providing flexibility for advanced analytics, machine
learning, and exploratory analysis.
Characteristics:
Schema-on-Read: Stores data as-is, applying structure only
when querying or processing.
Supports All Data Types: Handles structured, semi-structured,
and unstructured data, such as JSON, images, logs, and videos.
ACID Transaction Support
Scalable and Cost-Effective: Designed for large-scale,
cost-efficient storage on distributed systems
06. Data Platform Journey
What is Data Lakehouse?
A data lakehouse is a hybrid data management architecture that
combines the strengths of both data lakes and data warehouses
into a single system, providing the flexibility of data lakes with the
performance and reliability of data warehouses.
Characteristics:
Supports both schema-on-read and schema-on-write.
Handles structured, semi-structured, and unstructured data.
Provides ACID transactions and data governance for consistency
and reliability.
Optimized for both batch and real-time analytics, supporting BI
and advanced analytics in the same platform.
07. Operationalizing Big Data Analytics
07. Operationalizing Big Data Analytics
Why Big Data isn't suitable for small
datasets?
1. High Infrastructure Overhead
2. Complexity vs. Simplicity
3. Latency in Processing
4. Inefficiency in Resource Utilization
5. No Need for Distributed Storage
08. Big Data Challenges
Big Data Challenges:
1. The Overwhelming Volume of Data.
2. The Complexity of Data Variety.
3. The Need for Real-Time Processing.
4. The Challenge of Data Quality &
Accuracy.
5. Security & Privacy Risks in Big Data.
6. The Struggle with Query Performance
Optimization.
7. The Problem of Hotspotting & Workload
Skew.
8. The Challenge of Fault Tolerance &
System Reliability.
9. The Complexity of Machine Learning &
AI Integration.
10. The Growing Skill Gap in Big Data.
08. Big Data Challenges
08. Big Data Challenges
09. Definition of Data Science
What’s Data Science?
The U.S. Census Bureau defines data science as
"a field of study that uses scientific methods,
processes, and systems to extract knowledge
and insights from data.”
Data science combines math and statistics,
specialized programming, advanced analytics,
artificial intelligence (AI) and machine learning
with specific subject matter expertise to
uncover actionable insights hidden in an
organization’s data.
10. Differences Between Big Data Analytics and Data Science
Aspect Big Data Data Science
Handling and processing vast Extracting insights and
Definition
amounts of data knowledge from data
Efficient storage, processing, and Analyzing data to inform
Objective
management of data decisions and predict trends
Volume, velocity, and variety of Analytical methods, models, and
Focus
data algorithms
Collection, storage, and Data analysis, modeling, and
Primary Tasks
processing of data interpretation
10. Differences Between Big Data Analytics and Data Science
Aspect Big Data Data Science
Tools/Technolo Hadoop, Spark, NoSQL databases Python, R, TensorFlow,
gies (e.g., MongoDB) Scikit-Learn
Structured, semi-structured, and Processed and cleaned data for
Data Types
unstructured data analysis
Accessible data repositories for Actionable insights, predictive
Outcome
analysis models
Data engineering, distributed Statistical analysis, machine
Skill Set
computing learning, programming
10. Differences Between Big Data Analytics and Data Science
Aspect Big Data Data Science
Data Scientists, Machine Learning
Typical Roles Data Engineers, Big Data Analysts
Engineers
Real-time data processing, Predictive analytics, data-driven
Applications
large-scale data storage decision making
Distributed computing, data Statistical modeling, machine
Key Techniques
warehousing learning algorithms
Data Analysis
11. Characteristic of Data Science At this stage, the data is ready, so data scientists
begin the so-called exploratory data analysis
(EDA) process. The aim is to understand the data's
The data science life cycle encompasses five key underlying structures and main characteristics and
steps that data must go through in order to identify patterns.
provide valuable insights.
Data Modelling
Data Collection
Data scientists use machine learning algorithms or
The first step in the data science life cycle is different statistical techniques in order to predict
obtaining the data. Data scientists can collect data outcomes or explain relationships within the data.
from a variety of sources, including databases, Depending on the problem, these models can be
sensors, APIs (application programming predictive, such as forecasting future sales, or
interfaces), and online platforms. descriptive, such as clustering customers by
behavior.
Data Cleaning
Data Interpretation
After the data is collected, data scientists clean
and preprocess it. This step, often referred to as At this stage, data scientists should have produced
data cleaning or wrangling, requires data scientists results by the end of the cycle, and all that is left to
to format the data for analysis and deal with do is interpret the conclusions and communicate
missing values, duplicates, and other errors. them to the rest of the team.
12. Introduction to Big Data Tools
What’s Hadoop?
Hadoop has two main components:
Hadoop
HDFS (Hadoop Distributed File System): This is
Hadoop is an open-source software framework that
the storage component of Hadoop, which allows
is used for storing and processing large amounts of
for the storage of large amounts of data across
data in a distributed computing environment. It is
multiple machines. It is designed to work with
designed to handle big data and is based on the
commodity hardware, which makes it
MapReduce programming model, which allows for
cost-effective.
the parallel processing of large datasets.
YARN (Yet Another Resource Negotiator): This is
the resource management component of Hadoop,
which manages the allocation of resources (such
as CPU and memory) for processing the data
stored in HDFS.
12. Introduction to Big Data Tools
How Hadoop Ecosystem Looks Like?
13. Big Data Ecosystem
What’s HDFS?
The Apache Hadoop HDFS is the distributed file
system of Hadoop that is designed to store large
files on cheap hardware. It is highly fault-tolerant
and provides high throughput to applications.
HDFS is best suited for those applications which
are having very large data sets.
The Hadoop HDFS file system provides Master
and Slave architecture. The Master node runs
Name node daemons and Slave nodes run
Datanode daemons.
13. Big Data Ecosystem
What’s MapReduce?
Map-Reduce is the data processing layer of
Hadoop, It distributes the task into small pieces
and assigns those pieces to many machines joined
over a network, and assembles all the events to
form the last event dataset. The basic detail
required by Map-Reduce is a key-value pair. All the
data, whether structured or not, needs to be
translated to the key-value pair before it is passed
through the Map-Reduce model. In the
Map-Reduce Framework, the processing unit is
moved to the data rather than moving the data to
the processing unit.
13. Big Data Ecosystem YARN stands for “Yet Another Resource
Negotiator” which is the Resource Management
level of the Hadoop Cluster. YARN is used to
What’s YARN? implement resource management and job
scheduling in the Hadoop cluster. The primary idea
of YARN is to split the job scheduling and resource
management into various processes and make the
operation.
YARN gives two daemons; the first one is called
Resource Manager and the seconds one is called
Node Manager. Both components are used to
process data-computation in YARN. The Resource
Manager runs on the master node of the Hadoop
cluster and negotiates resources in all applications
whereas the Node Manager is hosted on all Slave
nodes. The responsibility of the Node Manager is
to monitor the containers, resource usage such as
(CPU, memory, disk, and network) and provide
detail to the Resource Manager.
13. Big Data Ecosystem
What’s Zookeeper?
Apache Zookeeper acts as a coordinator between
different services of Hadoop and is used for
maintaining configuration information, naming,
providing distributed synchronization, and
providing group services. Zookeeper is used to fix
bugs and race conditions for those applications,
which are newly deployed in a distributed
environment.
13. Big Data Ecosystem
What’s Hive?
Apache Hive is a data-warehousing project of
Hadoop. Hive is intended to facilitate informal data
summarization, ad-hoc querying, and interpretation
of large volumes of data. With the help of HiveQL,
a user can perform ad-hoc queries on the dataset
store in HDFS and use that data to do further
analysis. Hive also supports custom user-defined
functions that can be used by users to perform
custom analysis.
13. Big Data Ecosystem
What’s HBase?
Apache HBase is a distributed, open-source,
versioned, and non-relational database that is
created after Google's Bigtable. It is an import
component of the Hadoop ecosystem that
leverages the fault tolerance feature of HDFS and
it provides real-time read and writes access to
data. Hbase can be called a data storage system
despite a database because it doesn’t provide
RDBMS features like triggers, query language, and
secondary indexes.
13. Big Data Ecosystem
What’s Spark?
Apache Spark is a general-purpose and fast
cluster computing system. It is a very powerful tool
for Big data. Spark provides a rich set of APIs in
multiple languages such as Python, Scala, Java,
R, and so on. Spark supports high-level tools
which are Spark SQL, GraphX, MLlib, Spark
Streaming, R. These tools are used to perform a
different kind of operation which we will see in the
Apache Spark section.
14 Big Data Use Case Customer Profiling
Customer Profiling &
Recommendation System Simulation
In a rapidly evolving market, businesses must
understand their customers to tailor experiences
that increase satisfaction, engagement, and
revenue. This simulation delves into a data-driven
approach to customer profiling, churn prediction,
and personalized product recommendations using
Apache Spark.
14 Big Data Use Case Profiling
1. Customer Data Overview
This DataFrame contains basic customer attributes, such as:
● customer_id: A unique identifier for each customer.
● age: The customer’s age.
● gender: The customer’s gender (Male/Female).
● location: The city where the customer is based.
● amount_spent: How much the customer spent on their last purchase.
● purchase_date: The specific day of purchase.
● customer_name: Assigned name for easier interpretation.
14 Big Data Use Case Profiling
This dataset lays the foundation for customer behavior analysis.
14 Big Data Use Case Profiling
2. Product Catalog Overview
This DataFrame contains product details, including:
● product_name: The name of the product.
● product_id: A unique identifier for each product.
Example Products:
● Smartphone X (product_id 91900)
● Gaming Console (product_id 394900)
● VR Headset (product_id 544400)
14 Big Data Use Case Profiling
The product catalog enables linking purchases to specific items.
14 Big Data Use Case Profiling
3. Transaction Data Overview
This DataFrame shows which customers purchased which products, including:
● product_id: ID of the product bought
● customer_id: ID of the buyer.
● product_name: The actual product name.
● amount_spent: How much was paid for the product.
Example Transactions
● Hannah (customer_id 7) bought a Drone 4K (product_id 79300) for $285
● Jack (customer_id 9) purchased a Bluetooth Speaker (product_id 746800) for $123.
14 Big Data Use Case Profiling
This confirms successful product-to-customer mapping.
14 Big Data Use Case Profiling
4. Purchase Behavior Overview
This DataFrame tracks customer spending patterns, showing:
● purchase_count: The number of transactions a customer made.
● avg_spent: The average amount spent per purchase.
Example Insights:
● Rachel made 1 purchase, spending an average of $87.
● Bob spent an average of $92 per transaction.
14 Big Data Use Case Profiling
This helps identify high-value vs. occasional buyers.
14 Big Data Use Case Profiling
5. Customer Segmentation Overview
This DataFrame clusters customers based on purchase frequency and spending using K-Means
clustering:
● purchase_count: The number of purchases per customer.
● avg_spent: Average spending per transaction.
● features: Encoded vector [purchase_count, avg_spent] used for clustering.
● prediction: The assigned cluster (0, 1, or 2).
Example Insights:
● Hannah (customer_id 2707) is grouped in cluster 2, spending $285.
● Karen (customer_id 5650) belongs to cluster 0, having spent a high amount of $487.
14 Big Data Use Case Profiling
Businesses can now target different customer types separately.
14 Big Data Use Case Profiling
6. Churn Prediction Overview
This DataFrame predicts which customers are at risk of leaving:
● churn: 1 means likely to churn, 0 means active
● prediction: The AI’s decision (1 = churn risk, 0 = active)
Example Churn Risks:
● Bob (customer_id 8621) has a churn prediction of 1.0, meaning he’s likely to stop purchasing.
● Grace (customer_id 3146) is also flagged as a churn risk.
14 Big Data Use Case Profiling
Business Impact: Companies can offer promotions or loyalty rewards to retain at-risk customers.
14 Big Data Use Case Profiling
7. Recommendation Results Overview
This DataFrame provides AI-driven product suggestions based on customer purchasing patterns:
● customer_name: The buyer’s name.
● product_name: The recommended product.
● product_id: The suggested item’s ID.
● rating: The confidence score (higher = stronger recommendation).
Example Recommendations:
● Grace is recommended to buy "Smartphone X" with a high rating of 349.97.
● Nancy is strongly encouraged to purchase "Smartphone X" with a confidence score of 73.95.
14 Big Data Use Case Profiling
These recommendations help businesses suggest relevant products to customers, improving sales and
engagement.
14 Big Data Use Case Profiling
Final Summary
Your system successfully
● Mapped customer purchases to actual products
● Segmented customers into spending groups
● Predicted churn risks for customer retention
● Generated personalized product recommendations
15 Big Data Use Case Sentiment Analytics
Understanding Sentiment Analytics in Spark: A Narrative Approach
Imagine a world where businesses can instantly understand customer emotions, predicting how people
feel about a product, a service, or an experience. In today’s digital landscape, millions of customer
reviews, tweets, and social media comments are generated daily—each containing valuable insights into
public sentiment.
But how do businesses make sense of this massive textual data? Enter Sentiment Analysis in Apache
Spark, a powerful technique that harnesses Natural Language Processing (NLP) to classify emotions at
scale.
15 Big Data Use Case Sentiment Analytics
Running Sentiment Analytics in Spark
This simulation takes 100 text samples—a mix of product reviews, customer feedback, and user
opinions—and assigns sentiment labels based on emotional tone.
We leverage TextBlob, a simple yet effective NLP library, to classify text into three emotions:
Positive: Expresses joy, approval, or satisfaction.
Negative: Shows disappointment, frustration, or criticism.
Neutral: Neither strongly positive nor negative.
Each sentence undergoes sentiment scoring, calculated using polarity values from -1 to +1.
Spark processes thousands of reviews in parallel, enabling fast and scalable analysis.
15 Big Data Use Case Sentiment Analytics
Behind the Scenes: How Spark Classifies Sentiment
1. Loading text data: We ingest customer comments into a Spark DataFrame.
2. Applying NLP: Spark uses TextBlob to compute sentiment polarity for each sentence.
3. Classification: If the polarity score is:
○ Greater than 0, it's labeled Positive.
○ Less than 0, it's Negative.
○ Equal to 0, it's Neutral.
4. Final Output: A Spark DataFrame is generated, mapping each review to its sentiment label.
15 Big Data Use Case Sentiment Analytics
15 Big Data Use Case Sentiment Analytics
Business Impact: Why Sentiment Analysis Matters
With this pipeline in place, businesses can: ✅ Track customer happiness in real-time
✅ Predict brand reputation trends
✅ Identify product strengths & weaknesses
✅ Respond quickly to negative feedback before it escalates
Imagine an e-commerce brand launching a new product. Spark instantly scans thousands of
social media posts, detecting negative reviews before they go viral. This allows the company to
mitigate damage by addressing customer concerns proactively.
REFERENCES (SANDBOX PLAYGROUND)
1. Installation:
a. Virtualbox
b. download cloudera vm:
https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.13.0-0
-virtualbox.zip
c. attach cloudera to virtualbox
d. start image in virtualbox
2. configuration
a. after starting cloudera virtualbox, cloudera quickstart in browser will appear.
b. check cmd: hostname
c. hdfs dfs -ls /
d. service cloudera-scm-server status
e. su - (password: cloudera)
f. sudo /home/cloudera/cloudera-manager --force --express
g. wait some minute then done!!
REFERENCES
● https://seas.harvard.edu/news/what-data-science-definition-skills-applic
ations-more
● https://www.ibm.com/think/topics/big-data-analytics
● https://www.geeksforgeeks.org/data-science/
● https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-
increasing-values-are-bad/
● https://www.dasca.org/world-of-data-science/article/big-data-processin
g-transforming-data-into-actionable-insights
● https://www.upgrad.com/blog/major-challenges-of-big-data/
● https://www.datamation.com/big-data/big-data-challenges/
● https://www.oracle.com/id/big-data/what-is-big-data/
●
Thank You!
Have questions or want to stay connected?
Feel free to reach out:
[email protected]
linkedin.com/in/alfandy-sulaiman-barkah/
I'm open to discussions, collaborations, or
any feedback you may have.