0% found this document useful (0 votes)

24 views62 pages

Big Data Analytics Data Science-M10

The document presents an overview of Big Data Analytics and Data Science, detailing definitions, characteristics, and differences between the two fields. It covers key concepts such as the types of data processed, analysis methods, and tools used in big data environments. Additionally, it discusses challenges in big data and the data science life cycle, highlighting the importance of advanced analytics and machine learning in extracting valuable insights from large datasets.

Uploaded by

syafa909490

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views62 pages

Big Data Analytics Data Science-M10

Uploaded by

syafa909490

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Big Data Analytics

&
Data Science
Theory and Implementation
Presented By :
Alfandy Sulaiman Barkah, S.T
DevOps & Data Service Lead
@ ATI Business Group
Agenda 08 Big Data Challenges

01 Deﬁnition of Big Data 09 Deﬁnition of Data Science

10 Differences Between Big Data

02 Characteristic of Big Data Analytics and Data Science

03 Differences Between Big Data and 11 Characteristic of Data Science

Traditional Data

04 Deﬁnition of Big Data Analytics 12 Introduction of Big Data Tools

05 Four Main Big Data Analysis Methods 13 Big Data Ecosystem

06 Data Platform Journey 14 Big Data Use Case Customer Proﬁling

07 Operationalizing Big Data Analytics 15 Big Data Use Case Sentiment Analytic
01. Deﬁnition of Big Data
What is Big Data?
Big data refers to extremely large and complex data
sets that cannot be easily managed or analyzed
with traditional data processing tools, particularly
spreadsheets. Big data includes structured data,
like an inventory database or list of ﬁnancial
transactions; unstructured data, such as social
posts or videos; and mixed data sets, like those
used to train large language models for AI.
Variety
Variety refers to the many types of data that are

02. Characteristic of Big Data available. Traditional data types were structured
and ﬁt neatly in a relational database. With the
rise of big data, data comes in new unstructured
data types. Unstructured and semistructured
data types, such as text, audio, and video, require
What is Characteristic of Big Data? additional preprocessing to derive meaning and
support metadata.
Traditionally, we’ve recognized big data by three
characteristics: variety, volume, and velocity, also Veracity
known as the “three Vs.” However, two additional Data reliability and accuracy are critical, as
decisions based on inaccurate or incomplete
Vs have emerged over the past few years: value and data can lead to negative outcomes. Veracity
veracity. refers to the data's trustworthiness,
encompassing data quality, noise and anomaly
Volume detection issues.
The sheer volume of data generated today, from
social media feeds, IoT devices, transaction Value
records and more, presents a signiﬁcant Big data analytics aims to extract actionable
challenge. insights that offer tangible value. This involves
turning vast data sets into meaningful
information that can inform strategic decisions,
Velocity
Velocity is the fast rate at which data is received uncover new opportunities and drive innovation.
and (perhaps) acted on. Normally, the highest Advanced analytics, machine learning and AI are
velocity of data streams directly into memory key to unlocking the value contained within big
versus being written to disk. data, transforming raw data into strategic assets.
02. Characteristic of Big Data
What Types of Data are Processed in Big Data?

Structured Data

Structured data refers to highly organized information

that is easily searchable and typically stored in relational
databases or spreadsheets. It adheres to a rigid schema,
meaning each data element is clearly defined and
accessible in a fixed field within a record or file.
02. Characteristic of Big Data
What Types of Data are Processed in Big Data?

Unstructured Data
Unstructured data lacks a pre-defined data model, making
it more difficult to collect, process and analyze. It comprises
the majority of data generated today, and includes formats
such as:
● Textual content from documents, emails and social
media posts
● Multimedia content, including images, audio files and
videos
● Data from IoT devices, which can include a mix of
sensor data, log files and time-series data
02. Characteristic of Big Data
What Types of Data are Processed in Big Data?

Semi-structured Data

Semi-structured data occupies the middle ground

between structured and unstructured data. While it does
not reside in a relational database, it contains tags or other
markers to separate semantic elements and enforce
hierarchies of records and ﬁelds within the data
02. Characteristic of Big Data
Type of Data Schema? Storage Method Examples Processing Tools

Structured Yes SQL Databases Financial records, CRM MySQL,

data PostgreSQL

Semi-Structured Partial NoSQL & Hybrid JSON, Emails, IoT Sensor MongoDB,
Systems Logs Cassandra

Unstructured No Data Lakes & Videos, Images, Social AI, NLP, Hadoop
Cloud Storage Media
03. Differences Between Big Data and Traditional Data
The key difference between big data analytics
and traditional data analytics lies in the type of
data processed and the tools used.

Traditional Data Big data analytics

Traditional Data analytics focuses on structured Big Data Analytics handles vast amounts of
data stored in relational databases, relying on structured, semi-structured, and unstructured
statistical methods and tools like SQL for querying data, requiring advanced techniques such as
and analysis. machine learning and data mining. It often
employs distributed systems like Hadoop to
manage large-scale processing.
04. Deﬁnition of Big Data Analytics
What is Big Data Analytics?
Big data analytics is the systematic processing of
large, complex data sets to uncover valuable
insights. It helps identify trends, patterns, and
correlations in raw data, enabling informed
decision-making. Organizations use big data
analytics to harness massive amounts of
information from sources like IoT sensors, social
media, ﬁnancial transactions, and smart devices,
applying advanced techniques to generate
actionable intelligence.
05. Four Main Big Data Analysis Method
These are the four methods of data analysis at work within big data:
05. Four Main Big Data Analysis Method

Descriptive Analytics: Understanding "What Happened"

Descriptive analytics focuses on summarizing past data to identify

trends and patterns. It helps businesses understand historical
performance by analyzing key characteristics of their data.

Example Use Case:

A company reviewing sales data might discover a seasonal surge
in video game console purchases every October through
December. Descriptive analytics highlights this trend, offering
insights into consumer behavior.
05. Four Main Big Data Analysis Method

Diagnostic Analytics: Understanding "Why It Happened"

Diagnostic analytics focuses on identifying the root causes behind
observed trends and patterns in descriptive analytics. By analyzing
data in depth, organizations can uncover areas for improvement,
implement changes, and optimize business processes.
Example Use Case:
In digital marketing, a sudden drop in website traffic requires
diagnostic analysis to determine the underlying issue. By
examining user behavior, businesses can identify and resolve
problems to prevent further losses in engagement and conversions.
05. Four Main Big Data Analysis Method

Predictive Analytics: Anticipating "What Will Happen"

Predictive analytics utilizes historical data, statistical modeling, and
machine learning to forecast future trends, helping organizations
make informed decisions.
Example Use Case:
In sales forecasting, businesses analyze past sales data, market
trends, and customer behavior to estimate future revenue and
demand. This enables companies to craft personalized marketing
campaigns, optimize product launches, and strategically position
themselves to maximize growth opportunities.
05. Four Main Big Data Analysis Method

Prescriptive Analytics: Guiding "What to Do"

Prescriptive analytics goes beyond predicting future outcomes by
providing actionable recommendations for optimizing decisions
and strategies. It leverages insights from previous analytics to
suggest the best course of action.
Example Use Case:
Companies use prescriptive analytics to segment customers and
determine the best marketing approach for each group. It can
suggest optimal product recommendations, personalized
promotions, and ideal communication channels, ensuring
businesses connect with customers more effectively.
06. Data Platform Journey
06. Data Platform Journey
What is Data Warehouse?

A data warehouse is a centralized, structured repository

designed to store and manage structured data from various
sources for business intelligence (BI) and analytics.

Characteristics:

Optimized for SQL-based queries, analytics, and reporting.

Uses schema-on-write, meaning data is cleaned, transformed,

and structured before storage.

Provides high-performance querying and fast access to

well-organized data.
06. Data Platform Journey
What is Data Lakes?

A data lake is a scalable storage system designed to hold raw,

unstructured, semi-structured, and structured data in its native
format, providing flexibility for advanced analytics, machine
learning, and exploratory analysis.

Characteristics:

Schema-on-Read: Stores data as-is, applying structure only

when querying or processing.

Supports All Data Types: Handles structured, semi-structured,

and unstructured data, such as JSON, images, logs, and videos.

ACID Transaction Support

Scalable and Cost-Effective: Designed for large-scale,

cost-efficient storage on distributed systems
06. Data Platform Journey
What is Data Lakehouse?

A data lakehouse is a hybrid data management architecture that

combines the strengths of both data lakes and data warehouses
into a single system, providing the flexibility of data lakes with the
performance and reliability of data warehouses.

Characteristics:

Supports both schema-on-read and schema-on-write.

Handles structured, semi-structured, and unstructured data.

Provides ACID transactions and data governance for consistency

and reliability.

Optimized for both batch and real-time analytics, supporting BI

and advanced analytics in the same platform.
07. Operationalizing Big Data Analytics
07. Operationalizing Big Data Analytics
Why Big Data isn't suitable for small
datasets?
1. High Infrastructure Overhead
2. Complexity vs. Simplicity
3. Latency in Processing
4. Inefficiency in Resource Utilization
5. No Need for Distributed Storage
08. Big Data Challenges
Big Data Challenges:
1. The Overwhelming Volume of Data.
2. The Complexity of Data Variety.
3. The Need for Real-Time Processing.
4. The Challenge of Data Quality &
Accuracy.
5. Security & Privacy Risks in Big Data.
6. The Struggle with Query Performance
Optimization.
7. The Problem of Hotspotting & Workload
Skew.
8. The Challenge of Fault Tolerance &
System Reliability.
9. The Complexity of Machine Learning &
AI Integration.
10. The Growing Skill Gap in Big Data.
08. Big Data Challenges
08. Big Data Challenges
09. Definition of Data Science
What’s Data Science?
The U.S. Census Bureau defines data science as
"a field of study that uses scientific methods,
processes, and systems to extract knowledge
and insights from data.”
Data science combines math and statistics,
specialized programming, advanced analytics,
artificial intelligence (AI) and machine learning
with specific subject matter expertise to
uncover actionable insights hidden in an
organization’s data.
10. Differences Between Big Data Analytics and Data Science
Aspect Big Data Data Science

Handling and processing vast Extracting insights and

Deﬁnition
amounts of data knowledge from data

Efﬁcient storage, processing, and Analyzing data to inform

Objective
management of data decisions and predict trends

Volume, velocity, and variety of Analytical methods, models, and

Focus
data algorithms

Collection, storage, and Data analysis, modeling, and

Primary Tasks
processing of data interpretation
10. Differences Between Big Data Analytics and Data Science
Aspect Big Data Data Science

Tools/Technolo Hadoop, Spark, NoSQL databases Python, R, TensorFlow,

gies (e.g., MongoDB) Scikit-Learn

Structured, semi-structured, and Processed and cleaned data for

Data Types
unstructured data analysis

Accessible data repositories for Actionable insights, predictive

Outcome
analysis models

Data engineering, distributed Statistical analysis, machine

Skill Set
computing learning, programming
10. Differences Between Big Data Analytics and Data Science
Aspect Big Data Data Science

Data Scientists, Machine Learning

Typical Roles Data Engineers, Big Data Analysts
Engineers

Real-time data processing, Predictive analytics, data-driven

Applications
large-scale data storage decision making

Distributed computing, data Statistical modeling, machine

Key Techniques
warehousing learning algorithms
Data Analysis

11. Characteristic of Data Science At this stage, the data is ready, so data scientists
begin the so-called exploratory data analysis
(EDA) process. The aim is to understand the data's
The data science life cycle encompasses ﬁve key underlying structures and main characteristics and
steps that data must go through in order to identify patterns.
provide valuable insights.
Data Modelling
Data Collection
Data scientists use machine learning algorithms or
The first step in the data science life cycle is different statistical techniques in order to predict
obtaining the data. Data scientists can collect data outcomes or explain relationships within the data.
from a variety of sources, including databases, Depending on the problem, these models can be
sensors, APIs (application programming predictive, such as forecasting future sales, or
interfaces), and online platforms. descriptive, such as clustering customers by
behavior.

Data Cleaning
Data Interpretation
After the data is collected, data scientists clean
and preprocess it. This step, often referred to as At this stage, data scientists should have produced
data cleaning or wrangling, requires data scientists results by the end of the cycle, and all that is left to
to format the data for analysis and deal with do is interpret the conclusions and communicate
missing values, duplicates, and other errors. them to the rest of the team.
12. Introduction to Big Data Tools
What’s Hadoop?
Hadoop has two main components:
Hadoop
HDFS (Hadoop Distributed File System): This is
Hadoop is an open-source software framework that
the storage component of Hadoop, which allows
is used for storing and processing large amounts of
for the storage of large amounts of data across
data in a distributed computing environment. It is
multiple machines. It is designed to work with
designed to handle big data and is based on the
commodity hardware, which makes it
MapReduce programming model, which allows for
cost-effective.
the parallel processing of large datasets.
YARN (Yet Another Resource Negotiator): This is
the resource management component of Hadoop,
which manages the allocation of resources (such
as CPU and memory) for processing the data
stored in HDFS.
12. Introduction to Big Data Tools
How Hadoop Ecosystem Looks Like?
13. Big Data Ecosystem
What’s HDFS?

The Apache Hadoop HDFS is the distributed file

system of Hadoop that is designed to store large
files on cheap hardware. It is highly fault-tolerant
and provides high throughput to applications.
HDFS is best suited for those applications which
are having very large data sets.

The Hadoop HDFS file system provides Master

and Slave architecture. The Master node runs
Name node daemons and Slave nodes run
Datanode daemons.
13. Big Data Ecosystem
What’s MapReduce?

Map-Reduce is the data processing layer of

Hadoop, It distributes the task into small pieces
and assigns those pieces to many machines joined
over a network, and assembles all the events to
form the last event dataset. The basic detail
required by Map-Reduce is a key-value pair. All the
data, whether structured or not, needs to be
translated to the key-value pair before it is passed
through the Map-Reduce model. In the
Map-Reduce Framework, the processing unit is
moved to the data rather than moving the data to
the processing unit.
13. Big Data Ecosystem YARN stands for “Yet Another Resource
Negotiator” which is the Resource Management
level of the Hadoop Cluster. YARN is used to
What’s YARN? implement resource management and job
scheduling in the Hadoop cluster. The primary idea
of YARN is to split the job scheduling and resource
management into various processes and make the
operation.

YARN gives two daemons; the first one is called

Resource Manager and the seconds one is called
Node Manager. Both components are used to
process data-computation in YARN. The Resource
Manager runs on the master node of the Hadoop
cluster and negotiates resources in all applications
whereas the Node Manager is hosted on all Slave
nodes. The responsibility of the Node Manager is
to monitor the containers, resource usage such as
(CPU, memory, disk, and network) and provide
detail to the Resource Manager.
13. Big Data Ecosystem
What’s Zookeeper?

Apache Zookeeper acts as a coordinator between

different services of Hadoop and is used for
maintaining configuration information, naming,
providing distributed synchronization, and
providing group services. Zookeeper is used to fix
bugs and race conditions for those applications,
which are newly deployed in a distributed
environment.
13. Big Data Ecosystem
What’s Hive?

Apache Hive is a data-warehousing project of

Hadoop. Hive is intended to facilitate informal data
summarization, ad-hoc querying, and interpretation
of large volumes of data. With the help of HiveQL,
a user can perform ad-hoc queries on the dataset
store in HDFS and use that data to do further
analysis. Hive also supports custom user-defined
functions that can be used by users to perform
custom analysis.
13. Big Data Ecosystem
What’s HBase?

Apache HBase is a distributed, open-source,

versioned, and non-relational database that is
created after Google's Bigtable. It is an import
component of the Hadoop ecosystem that
leverages the fault tolerance feature of HDFS and
it provides real-time read and writes access to
data. Hbase can be called a data storage system
despite a database because it doesn’t provide
RDBMS features like triggers, query language, and
secondary indexes.
13. Big Data Ecosystem
What’s Spark?

Apache Spark is a general-purpose and fast

cluster computing system. It is a very powerful tool
for Big data. Spark provides a rich set of APIs in
multiple languages such as Python, Scala, Java,
R, and so on. Spark supports high-level tools
which are Spark SQL, GraphX, MLlib, Spark
Streaming, R. These tools are used to perform a
different kind of operation which we will see in the
Apache Spark section.
14 Big Data Use Case Customer Profiling
Customer Profiling &
Recommendation System Simulation
In a rapidly evolving market, businesses must
understand their customers to tailor experiences
that increase satisfaction, engagement, and
revenue. This simulation delves into a data-driven
approach to customer profiling, churn prediction,
and personalized product recommendations using
Apache Spark.
14 Big Data Use Case Profiling
1. Customer Data Overview
This DataFrame contains basic customer attributes, such as:

● customer_id: A unique identiﬁer for each customer.

● age: The customer’s age.
● gender: The customer’s gender (Male/Female).
● location: The city where the customer is based.
● amount_spent: How much the customer spent on their last purchase.
● purchase_date: The specific day of purchase.
● customer_name: Assigned name for easier interpretation.
14 Big Data Use Case Profiling
This dataset lays the foundation for customer behavior analysis.
14 Big Data Use Case Profiling
2. Product Catalog Overview
This DataFrame contains product details, including:

● product_name: The name of the product.

● product_id: A unique identiﬁer for each product.

Example Products:

● Smartphone X (product_id 91900)

● Gaming Console (product_id 394900)
● VR Headset (product_id 544400)
14 Big Data Use Case Profiling
The product catalog enables linking purchases to specific items.
14 Big Data Use Case Profiling
3. Transaction Data Overview
This DataFrame shows which customers purchased which products, including:

● product_id: ID of the product bought

● customer_id: ID of the buyer.
● product_name: The actual product name.
● amount_spent: How much was paid for the product.

Example Transactions

● Hannah (customer_id 7) bought a Drone 4K (product_id 79300) for $285

● Jack (customer_id 9) purchased a Bluetooth Speaker (product_id 746800) for $123.
14 Big Data Use Case Profiling
This confirms successful product-to-customer mapping.
14 Big Data Use Case Profiling
4. Purchase Behavior Overview
This DataFrame tracks customer spending patterns, showing:

● purchase_count: The number of transactions a customer made.

● avg_spent: The average amount spent per purchase.

Example Insights:

● Rachel made 1 purchase, spending an average of $87.

● Bob spent an average of $92 per transaction.
14 Big Data Use Case Proﬁling
This helps identify high-value vs. occasional buyers.
14 Big Data Use Case Proﬁling
5. Customer Segmentation Overview
This DataFrame clusters customers based on purchase frequency and spending using K-Means
clustering:

● purchase_count: The number of purchases per customer.

● avg_spent: Average spending per transaction.
● features: Encoded vector [purchase_count, avg_spent] used for clustering.
● prediction: The assigned cluster (0, 1, or 2).

Example Insights:

● Hannah (customer_id 2707) is grouped in cluster 2, spending $285.

● Karen (customer_id 5650) belongs to cluster 0, having spent a high amount of $487.
14 Big Data Use Case Proﬁling
Businesses can now target different customer types separately.
14 Big Data Use Case Proﬁling
6. Churn Prediction Overview
This DataFrame predicts which customers are at risk of leaving:

● churn: 1 means likely to churn, 0 means active

● prediction: The AI’s decision (1 = churn risk, 0 = active)

Example Churn Risks:

● Bob (customer_id 8621) has a churn prediction of 1.0, meaning he’s likely to stop purchasing.
● Grace (customer_id 3146) is also flagged as a churn risk.
14 Big Data Use Case Profiling
Business Impact: Companies can offer promotions or loyalty rewards to retain at-risk customers.
14 Big Data Use Case Profiling
7. Recommendation Results Overview
This DataFrame provides AI-driven product suggestions based on customer purchasing patterns:

● customer_name: The buyer’s name.

● product_name: The recommended product.
● product_id: The suggested item’s ID.
● rating: The conﬁdence score (higher = stronger recommendation).

Example Recommendations:

● Grace is recommended to buy "Smartphone X" with a high rating of 349.97.

● Nancy is strongly encouraged to purchase "Smartphone X" with a confidence score of 73.95.
14 Big Data Use Case Profiling
These recommendations help businesses suggest relevant products to customers, improving sales and
engagement.
14 Big Data Use Case Profiling
Final Summary

Your system successfully

● Mapped customer purchases to actual products

● Segmented customers into spending groups
● Predicted churn risks for customer retention
● Generated personalized product recommendations
15 Big Data Use Case Sentiment Analytics
Understanding Sentiment Analytics in Spark: A Narrative Approach
Imagine a world where businesses can instantly understand customer emotions, predicting how people
feel about a product, a service, or an experience. In today’s digital landscape, millions of customer
reviews, tweets, and social media comments are generated daily—each containing valuable insights into
public sentiment.

But how do businesses make sense of this massive textual data? Enter Sentiment Analysis in Apache
Spark, a powerful technique that harnesses Natural Language Processing (NLP) to classify emotions at
scale.
15 Big Data Use Case Sentiment Analytics
Running Sentiment Analytics in Spark
This simulation takes 100 text samples—a mix of product reviews, customer feedback, and user
opinions—and assigns sentiment labels based on emotional tone.

We leverage TextBlob, a simple yet effective NLP library, to classify text into three emotions:

Positive: Expresses joy, approval, or satisfaction.

Negative: Shows disappointment, frustration, or criticism.

Neutral: Neither strongly positive nor negative.

Each sentence undergoes sentiment scoring, calculated using polarity values from -1 to +1.
Spark processes thousands of reviews in parallel, enabling fast and scalable analysis.
15 Big Data Use Case Sentiment Analytics
Behind the Scenes: How Spark Classifies Sentiment
1. Loading text data: We ingest customer comments into a Spark DataFrame.
2. Applying NLP: Spark uses TextBlob to compute sentiment polarity for each sentence.
3. Classification: If the polarity score is:
○ Greater than 0, it's labeled Positive.
○ Less than 0, it's Negative.
○ Equal to 0, it's Neutral.
4. Final Output: A Spark DataFrame is generated, mapping each review to its sentiment label.
15 Big Data Use Case Sentiment Analytics
15 Big Data Use Case Sentiment Analytics

Business Impact: Why Sentiment Analysis Matters

With this pipeline in place, businesses can: ✅ Track customer happiness in real-time
✅ Predict brand reputation trends
✅ Identify product strengths & weaknesses
✅ Respond quickly to negative feedback before it escalates

Imagine an e-commerce brand launching a new product. Spark instantly scans thousands of
social media posts, detecting negative reviews before they go viral. This allows the company to
mitigate damage by addressing customer concerns proactively.
REFERENCES (SANDBOX PLAYGROUND)

1. Installation:
a. Virtualbox
b. download cloudera vm:
https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.13.0-0
-virtualbox.zip
c. attach cloudera to virtualbox
d. start image in virtualbox
2. configuration
a. after starting cloudera virtualbox, cloudera quickstart in browser will appear.
b. check cmd: hostname
c. hdfs dfs -ls /
d. service cloudera-scm-server status
e. su - (password: cloudera)
f. sudo /home/cloudera/cloudera-manager --force --express
g. wait some minute then done!!
REFERENCES

● https://seas.harvard.edu/news/what-data-science-deﬁnition-skills-applic
ations-more
● https://www.ibm.com/think/topics/big-data-analytics
● https://www.geeksforgeeks.org/data-science/
● https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-
increasing-values-are-bad/
● https://www.dasca.org/world-of-data-science/article/big-data-processin
g-transforming-data-into-actionable-insights
● https://www.upgrad.com/blog/major-challenges-of-big-data/
● https://www.datamation.com/big-data/big-data-challenges/
● https://www.oracle.com/id/big-data/what-is-big-data/
●
Thank You!
Have questions or want to stay connected?
Feel free to reach out:

[email protected]

linkedin.com/in/alfandy-sulaiman-barkah/

I'm open to discussions, collaborations, or

any feedback you may have.

Emv Tutorial
0% (1)
Emv Tutorial
4 pages
Cyber Security MCQ
100% (6)
Cyber Security MCQ
2 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
34 pages
BDA Unit 1 Notes-1
No ratings yet
BDA Unit 1 Notes-1
34 pages
What Is Big Data
No ratings yet
What Is Big Data
4 pages
Unit 1 - ETI (BDA)
No ratings yet
Unit 1 - ETI (BDA)
20 pages
Big Data
No ratings yet
Big Data
19 pages
TP 4 2docuatrimestre
No ratings yet
TP 4 2docuatrimestre
10 pages
Big Data and Data Analysis: Offurum Paschal I Kunoch Education and Training College, Owerri
No ratings yet
Big Data and Data Analysis: Offurum Paschal I Kunoch Education and Training College, Owerri
35 pages
Getting An Overview of Big Data (Module1)
No ratings yet
Getting An Overview of Big Data (Module1)
58 pages
Big Data Analytics
No ratings yet
Big Data Analytics
32 pages
Unit 2
No ratings yet
Unit 2
35 pages
Big Data Analytics Project Proposal by Slidesgo
No ratings yet
Big Data Analytics Project Proposal by Slidesgo
12 pages
Big Data Analytics - Unit 1
No ratings yet
Big Data Analytics - Unit 1
29 pages
Unit 1 - Understanding Big Data
No ratings yet
Unit 1 - Understanding Big Data
39 pages
Unit-01 Bda
No ratings yet
Unit-01 Bda
25 pages
What Is Data
No ratings yet
What Is Data
20 pages
Notesfor BDA
No ratings yet
Notesfor BDA
59 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
4 pages
1.2 Big Data
No ratings yet
1.2 Big Data
23 pages
Big Data
No ratings yet
Big Data
34 pages
Kwasu-Csc204 Big Data Computing and Security-1
No ratings yet
Kwasu-Csc204 Big Data Computing and Security-1
57 pages
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
No ratings yet
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
40 pages
Challenges in Big Data Analytics Techniques
No ratings yet
Challenges in Big Data Analytics Techniques
6 pages
Big Data
No ratings yet
Big Data
69 pages
Big Data
No ratings yet
Big Data
28 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
38 pages
Bda Q&a
No ratings yet
Bda Q&a
15 pages
Unit 1
No ratings yet
Unit 1
44 pages
Big Data
No ratings yet
Big Data
20 pages
Kwasu-Csc204 Module 1 Big Data Computing and Security 2
No ratings yet
Kwasu-Csc204 Module 1 Big Data Computing and Security 2
22 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Introduction Part
No ratings yet
Introduction Part
5 pages
Big Data and Data Analytics
No ratings yet
Big Data and Data Analytics
6 pages
Unit 1 - Big Data Analytics - CCS334
No ratings yet
Unit 1 - Big Data Analytics - CCS334
35 pages
Data, Big
No ratings yet
Data, Big
90 pages
Introduction to Big Data Analytics
No ratings yet
Introduction to Big Data Analytics
30 pages
Big Data Processing
No ratings yet
Big Data Processing
38 pages
BD 1
No ratings yet
BD 1
15 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
22 pages
BDAchap 1
No ratings yet
BDAchap 1
15 pages
Big Data Analytics - CCS334 - Notes - ALL UNITS NOTES
No ratings yet
Big Data Analytics - CCS334 - Notes - ALL UNITS NOTES
130 pages
L01-Fundamentals of Big Data and Data Analytics
No ratings yet
L01-Fundamentals of Big Data and Data Analytics
58 pages
Big-Data-Unit 1
No ratings yet
Big-Data-Unit 1
23 pages
Big Data Analytics-Report
No ratings yet
Big Data Analytics-Report
7 pages
Intro. To Business Analytics
No ratings yet
Intro. To Business Analytics
44 pages
BIG Data1
No ratings yet
BIG Data1
49 pages
Big Data Analtics (Unit 1)
No ratings yet
Big Data Analtics (Unit 1)
31 pages
Big Data Analytics
No ratings yet
Big Data Analytics
14 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
Unit 1
No ratings yet
Unit 1
19 pages
KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
Big Data
No ratings yet
Big Data
54 pages
Dataanalyticsunit 1
No ratings yet
Dataanalyticsunit 1
26 pages
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
No ratings yet
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
12 pages
What Is Big Data Analytics-1
No ratings yet
What Is Big Data Analytics-1
9 pages
Unit I - BigData
No ratings yet
Unit I - BigData
47 pages
Resume 2025 Final
No ratings yet
Resume 2025 Final
2 pages
Sample PDF
No ratings yet
Sample PDF
11 pages
Housekeeping Unusued or Old Tablespace
No ratings yet
Housekeeping Unusued or Old Tablespace
3 pages
PA教育訓練 - Module 6
No ratings yet
PA教育訓練 - Module 6
44 pages
External Devices (Utility Card)
No ratings yet
External Devices (Utility Card)
20 pages
1 Introduction Fall24v1
No ratings yet
1 Introduction Fall24v1
19 pages
Sophos Intercept X Essentials Faq
No ratings yet
Sophos Intercept X Essentials Faq
2 pages
Client Centric Consistency Models
100% (1)
Client Centric Consistency Models
11 pages
2 - UniMAT HMI Catalog-3月26日
No ratings yet
2 - UniMAT HMI Catalog-3月26日
14 pages
Proposed List of Moocs (Nptel Jan-April-2023 Time Line) For B.E. Degree Honors
No ratings yet
Proposed List of Moocs (Nptel Jan-April-2023 Time Line) For B.E. Degree Honors
1 page
Wireless Earphones Pitch Guide
No ratings yet
Wireless Earphones Pitch Guide
3 pages
A Comprehensive Guide To Essential Linux Commands
No ratings yet
A Comprehensive Guide To Essential Linux Commands
28 pages
Top 100 C++ Interview Q&A
No ratings yet
Top 100 C++ Interview Q&A
55 pages
RS485, RS232, RS422, RS423, Quick Reference Guide
No ratings yet
RS485, RS232, RS422, RS423, Quick Reference Guide
3 pages
Class XII Computer Science Exam
No ratings yet
Class XII Computer Science Exam
3 pages
PAN Card Service Process Document
No ratings yet
PAN Card Service Process Document
2 pages
3G UMTS Architecture
No ratings yet
3G UMTS Architecture
48 pages
Short Answer Questions (MC) : UNIT-1 Wireless Transmission
No ratings yet
Short Answer Questions (MC) : UNIT-1 Wireless Transmission
7 pages
Session Level Yapp Handout PDF
No ratings yet
Session Level Yapp Handout PDF
27 pages
Turbo Codes Tutorial Guide
No ratings yet
Turbo Codes Tutorial Guide
21 pages
Descriptive Texts on Favorite Items
No ratings yet
Descriptive Texts on Favorite Items
6 pages
Marker Based Maze Game Developed On Unity Software: A Project Report On
No ratings yet
Marker Based Maze Game Developed On Unity Software: A Project Report On
22 pages
LLM Design for AI Developers
No ratings yet
LLM Design for AI Developers
3 pages
SnapMirror ActiveSync
No ratings yet
SnapMirror ActiveSync
2 pages
The Weakness of Excel
No ratings yet
The Weakness of Excel
5 pages
MTN Group Assignment
No ratings yet
MTN Group Assignment
15 pages
HSDPA HSUPA and HSPA+
No ratings yet
HSDPA HSUPA and HSPA+
65 pages
CP Imp Programs
No ratings yet
CP Imp Programs
11 pages

Big Data Analytics Data Science-M10

Uploaded by

Big Data Analytics Data Science-M10

Uploaded by

Big Data Analytics

01 Deﬁnition of Big Data 09 Deﬁnition of Data Science

10 Differences Between Big Data

03 Differences Between Big Data and 11 Characteristic of Data Science

04 Deﬁnition of Big Data Analytics 12 Introduction of Big Data Tools

05 Four Main Big Data Analysis Methods 13 Big Data Ecosystem

06 Data Platform Journey 14 Big Data Use Case Customer Proﬁling

Structured data refers to highly organized information

Semi-structured data occupies the middle ground

Structured Yes SQL Databases Financial records, CRM MySQL,

Traditional Data Big data analytics

Descriptive Analytics: Understanding "What Happened"

Descriptive analytics focuses on summarizing past data to identify

Example Use Case:

Diagnostic Analytics: Understanding "Why It Happened"

Predictive Analytics: Anticipating "What Will Happen"

Prescriptive Analytics: Guiding "What to Do"

A data warehouse is a centralized, structured repository

Optimized for SQL-based queries, analytics, and reporting.

Uses schema-on-write, meaning data is cleaned, transformed,

Provides high-performance querying and fast access to

A data lake is a scalable storage system designed to hold raw,

Schema-on-Read: Stores data as-is, applying structure only

Supports All Data Types: Handles structured, semi-structured,

ACID Transaction Support

Scalable and Cost-Effective: Designed for large-scale,

A data lakehouse is a hybrid data management architecture that

Supports both schema-on-read and schema-on-write.

Handles structured, semi-structured, and unstructured data.

Provides ACID transactions and data governance for consistency

Optimized for both batch and real-time analytics, supporting BI

Handling and processing vast Extracting insights and

Efﬁcient storage, processing, and Analyzing data to inform

Volume, velocity, and variety of Analytical methods, models, and

Collection, storage, and Data analysis, modeling, and

Tools/Technolo Hadoop, Spark, NoSQL databases Python, R, TensorFlow,

Structured, semi-structured, and Processed and cleaned data for

Accessible data repositories for Actionable insights, predictive

Data engineering, distributed Statistical analysis, machine

Data Scientists, Machine Learning

Real-time data processing, Predictive analytics, data-driven

Distributed computing, data Statistical modeling, machine

The Apache Hadoop HDFS is the distributed file

The Hadoop HDFS file system provides Master

Map-Reduce is the data processing layer of

YARN gives two daemons; the first one is called

Apache Zookeeper acts as a coordinator between

Apache Hive is a data-warehousing project of

Apache HBase is a distributed, open-source,

Apache Spark is a general-purpose and fast

● customer_id: A unique identiﬁer for each customer.

● product_name: The name of the product.

● Smartphone X (product_id 91900)

● product_id: ID of the product bought

● Hannah (customer_id 7) bought a Drone 4K (product_id 79300) for $285

● purchase_count: The number of transactions a customer made.

● Rachel made 1 purchase, spending an average of $87.

● purchase_count: The number of purchases per customer.

● Hannah (customer_id 2707) is grouped in cluster 2, spending $285.

● churn: 1 means likely to churn, 0 means active

Example Churn Risks:

● customer_name: The buyer’s name.

● Grace is recommended to buy "Smartphone X" with a high rating of 349.97.

Your system successfully

● Mapped customer purchases to actual products

Positive: Expresses joy, approval, or satisfaction.

Negative: Shows disappointment, frustration, or criticism.

Neutral: Neither strongly positive nor negative.

Business Impact: Why Sentiment Analysis Matters

I'm open to discussions, collaborations, or

You might also like