12 AI Unit 5 Introduction To Big Data and Data Analytics
12 AI Unit 5 Introduction To Big Data and Data Analytics
INTELLIGENCE
CLASS XII
STUDENT HANDBOOK
2025-26
Subject Code:843
UNIT 5: Introduction to Big Data and Data Analytics
Title: Introduction to Big Data and Data Approach: Team discussion, Web search
Analytics
Summary: Students will delve into the world of Big Data, a game-changer in today's
digital age. Students gain insights into the various types of data and their unique
characteristics, equipping them to understand how this vast information is managed
and analysed. The journey continues as students discover the real-world applications
of Big Data and Data Analytics in diverse fields, witnessing how this revolutionary
concept is transforming how we approach data analysis to unlock new possibilities.
Learning Objectives:
1. Students will develop an understanding of the concept of Big Data and its
development in the new digital era.
2. Students will appreciate the role of big data in AI and Data Science.
3. Students will learn to understand the features of Big Data and how these
features are handled in Big Data Analytics.
4. Students will appreciate its applications in various fields and how this new
concept has evolved to bring new dimensions to Data Analysis.
5. Students will understand the term mining data streams.
Key Concepts:
1. Introduction to Big Data
2. Types of Big Data
3. Advantages and Disadvantages of Big Data
4. Characteristics of Big Data
5. Big Data Analytics
6. Working on Big Data Analytics
7. Mining Data Streams
8. Future of Big Data Analytics
Learning Outcomes:
Students will be able to –
1. Define Big Data and identify its various types.
2. Evaluate the advantages and disadvantages of Big Data.
3. Recognize the characteristics of Big Data.
4. Explain the concept of Big Data Analytics and its significance.
5. Describe how Big Data Analytics works.
6. Exploring the future trends and advancements in Big Data Analytics.
86
5.1. What is Big Data?
To understand Big Data, let us first understand
small data.
Big Data refers to extremely large and complex datasets that regular computer programs
and databases cannot handle. It comes from three main sources: transactional data (e.g.,
online purchases), machine data (e.g., sensor readings), and social data (e.g., social media
posts). To analyze and use Big Data effectively, special tools and techniques are required.
These tools help organizations find valuable insights hidden in the data, which lead to
innovations and better decision-making. For example, companies like Amazon and Netflix
use Big Data to recommend products or shows based on users’ past activities.
Fig. 5.2
Semi-Structured
Aspect Structured Data Unstructured Data
Data
Quantitative data A mix of quantitative No inherent
Definition with a defined and qualitative structures or formal
structure properties rules
Dedicated data May lack a specific Lacks a consistent
Data Model
model data model data model
87
Semi-Structured
Aspect Structured Data Unstructured Data
Data
No organization
Organized in clearly Less organized than
Organization exhibits variability
defined columns structured data
over time
Accessibility depends
Easily accessible and Accessible but may
Accessibility on the specific data
searchable be harder to analyze
format
Examples Customer XML files, CSV files, Audio files, images,
information, JSON files, HTML video files, emails,
transaction records, files, PDFs, social media
product directories semi-structured posts
documents
Big Data is a key to modern innovation. It has changed how organizations analyze and use
information. While it offers great benefits, it also comes with challenges that affect its use
in different industries. In this section, we will be discussing a few pros and cons of big data.
Advantages:
● Enhanced Decision Making: Big Data analytics empowers organizations to make
data-driven decisions based on insights derived from large and diverse datasets.
● Improved Efficiency and Productivity: By analyzing vast amounts of data,
businesses can identify inefficiencies, streamline processes, and optimize resource
allocation, leading to increased efficiency and productivity.
● Better Customer Insights: Big Data enables organizations to gain a deeper
understanding of customer behavior, preferences, and needs, allowing for
personalized marketing strategies and improved customer experiences.
● Competitive Advantage: Leveraging Big Data analytics provides organizations with
a competitive edge by enabling them to uncover market trends, identify
opportunities, and stay ahead of competitors.
● Innovation and Growth: Big Data fosters innovation by facilitating the development
of new products, services, and business models based on insights derived from data
analysis, driving business growth and expansion.
88
Disadvantages:
● Privacy and Security Concerns: The collection, storage, and analysis of large
volumes of data raise significant privacy and security risks, including unauthorized
access, data breaches, and misuse of personal information.
● Data Quality Issues: Ensuring the accuracy, reliability, and completeness of data can
be challenging, as Big Data often consists of unstructured and heterogeneous data
sources, leading to potential errors and biases in analysis.
● Technical Complexity: Implementing and managing Big Data infrastructure and
analytics tools require specialized skills and expertise, leading to technical
challenges and resource constraints for organizations.
● Regulatory Compliance: Organizations face challenges in meeting data protection
laws like GDPR (General Data Protection Regulation) and The Digital Personal Data
Protection Act, 2023. These laws require strict handling of personal data, making
compliance essential to avoid legal risks and penalties.
● Cost and Resource Intensiveness: The cost of acquiring, storing, processing, and
analyzing Big Data, along with hiring skilled staff, can be high. This is especially
challenging for smaller organizations with limited budgets and resources.
Activity: Find the sources of big data using the link UNSTATS
89
5.4.1. Velocity: Velocity refers to the speed at
which data is generated, delivered, and analyzed.
In the present world, where millions of people are
accessing and storing information online, the
speed at which the data gets stored or generated
is huge. For example: Google alone generates
more than 40,000 search queries per second. See
Fig. 5.4 Speed of data generation from various sources
the statistics in the picture provided. Isn’t it huge!
90
5.4.4. Veracity: Veracity is a characteristic in
Big Data related to consistency, accuracy,
quality, and trustworthiness. Not all data that
undergoes processing holds value.
Therefore, it is essential to clean data
effectively before storing or processing it,
especially when dealing with massive
volumes. Veracity addresses this aspect of
big data, focusing on the accuracy and
reliability of the data source and its suitability
Fig. 5.7
for analytical models.
Fig. 5.9
Case Study: How a Company Uses 3V and 6V Frameworks for Big Data
Company: An OTT Platform ‘OnDemandDrama’
3V Framework:
Volume: OnDemandDrama processes huge amounts of data from millions of users, including watch
history, ratings, searches, and preferences to offer personalized content recommendations.
Velocity: Data is processed in real-time, allowing OnDemandDrama to immediately adjust
recommendations, track the patterns of the users, and offer trending content based on their
activity.
Variety: The platform handles diverse data such as user profiles, watch lists, video content, and
user reviews which are categorized as structured, semi-structured, and unstructured data.
6V Framework:
Along with the above 3 V of big data, the 6V Framework involves 3 more features of big data named
Veracity, Value, and Variability.
91
Veracity: OnDemandDrama filters out irrelevant or low-quality data (such as incomplete profiles) to
ensure accurate content recommendations.
Value: OnDemandDrama uses the data to personalize user experiences, driving engagement and
retention by recommending shows and movies that match individual tastes.
Variability: OnDemandDrama handles changes or inconsistencies in data streams caused by factors
like user behavior, trends, or any other external events. For example, user preferences can vary
based on region, time, or trends.
By using the 3V and 6V frameworks, OnDemandDrama can manage, process, and derive valuable
insights from its Big Data, which enhances customer satisfaction and drives business decisions.
Big data analytics uses advanced analytic techniques against huge, diverse datasets that
include structured, semi-structured, and unstructured data, from different sources, and in
various sizes from terabytes to zettabytes.
92
Big Data Analytics emerges as a consequence of four significant global trends:
1. Moore’s Law: The exponential growth of computing power as per Moore's Law has
enabled the handling and analysis of massive datasets, driving the evolution of Big
Data Analytics.
2. Mobile Computing: With the widespread adoption of smartphones and mobile
devices, access to vast amounts of data is now at our fingertips, enabling real-time
connectivity and data collection from anywhere.
3. Social Networking: Platforms such as Facebook, Foursquare, and Pinterest facilitate
extensive networks of user-generated content, interactions, and data sharing,
leading to the generation of massive datasets ripe for analysis.
4. Cloud Computing: This paradigm shift in technology infrastructure allows
organizations to access hardware and software resources remotely via the Internet
on a pay-as-you-go basis, eliminating the need for extensive on-premises hardware
and software investments.
Example: Data Analytics Tools – Tableau, APACHE Hadoop, Cassandra, MongoDB, SaS
93
Using Orange Data Mining for Big Data Analytics
We will explore how big data analysis can be performed using Orange Data Mining.
It is important to carefully study the dataset and understand the features and target
variable.
● Features: age, gender, chest pain, resting blood pressure (rest_spb), cholesterol,
resting ECG (rest_ecg), maximum heart rate (max_hr), etc.
● Target: diameter narrowing.
94
Step 2: Process Data
Data processing involves preparing the data for accurate analysis. There are two methods:
You will see that all numerical values are now scaled between 0 and 1.
95
Step 3: Clean Data
Data cleaning is essential to ensure quality results. We will use the Impute widget to
handle missing values by replacing them with the mean, median, mode, or a custom
value. In this data we all can see that some values are missing in the figure below. This
missing value data set is being saved as heart data.xlsx in the computer folder.
96
Step 3.3: Verify Cleaned Data
1. Connect the Data Table widget to the Impute widget.
2. Open the Data Table to confirm the missing values have been replaced.
Missing values are now filled with the chosen method (e.g., average values).
97
Step 4.3: Choose a Validation Method
1. Double-click the Test and Score widget. Select a validation method (e.g., Cross-
Validation).
To understand mining data streams, we first understand what data stream is. A data
stream is a continuous, real-time flow of data generated by various sources. These sources
can include sensors, satellite image data, Internet and web traffic, etc.
Mining data streams refers to the process of extracting meaningful patterns, trends,
and knowledge from a continuous flow of real-time data. Unlike traditional data mining, it
processes data as it arrives, without storing it completely. An example of an area where data
98
stream mining can be applied is website data. Websites typically receive continuous
streams of data daily. For instance, a sudden spike in searches for "election results" on a
particular day might indicate that elections were recently held in a region or highlight the
level of public interest in the results.
The future of Big Data Analytics is highly influenced by several key technological
advancements that will shape the way data is processed and analyzed. A few of them are:
-----------------------------------------------------------------------------------------------------
Education
Environmental Science
Media and
Entertainment
99
Activity-2
List the steps involved in the working process of Big Data analytics.
Step 1:
Step 2:
Step 3:
Step 4:
EXERCISES
5. Which technique is commonly used for analyzing large datasets to discover patterns
and relationships?
a) Linear regression b) Data mining c) Decision trees d) Naive Bayes
6. Which term describes the process of extracting useful information from large
datasets?
a) Data analytics b) Data warehousing c) Data integration d) Data virtualization
100
8. What role does Hadoop play in big data processing?
a) Hadoop is a programming language used for big data analytics.
b) Hadoop is a distributed file system for storing and processing big data.
c) Hadoop is a data visualization tool.
d) Hadoop is a NoSQL database management system.
9. What is the primary challenge associated with the veracity aspect of big data?
a) Handling large volumes of data
b) Ensuring data quality and reliability
c) Dealing with diverse data types
d) Managing data processing speed
B. True or False
1. Big data refers to datasets that are too large to be processed by traditional
database systems.
2. Structured data is the primary type of data processed in big data analytics, making
up the majority of datasets.
3. Veracity refers to the trustworthiness and reliability of data in big data analytics
4. Real-time analytics involves processing and analyzing data as it is generated, without
any delay.
5. Cloud computing is the only concept used in Big Data Analytics.
6. A CSV file is an example of structured data.
7. “Positive, Negative, and Neutral” are terms related to Sentiment Analysis.
8. Data preprocessing is a critical step in big data analytics, involving cleaning,
transforming, and aggregating data to prepare it for analysis.
9. To analyze vast collections of textual materials to capture key concepts, trends, and
hidden relationships, the concept of Text mining is used.
101
D. Long answer questions
1. Explain the 6 V’s related to Big data.
2. Explain the differences between structured, semi-structured, and unstructured data.
3. Explain the process of Big Data Analytics.
4. Why is Big Data Analytics important in modern industries and decision-making
processes?
5. A healthcare company is using Big Data analytics to manage patient records, predict
disease outbreaks, and personalize treatments. However, the company is facing
challenges regarding data privacy, as patient information is highly sensitive. What
are the potential risks to patient privacy when using Big Data in healthcare, and how
can these be mitigated?
6. Given the following list of data types, categorize each as Structured, Unstructured,
or Semi-Structured:
a) A customer database with fields such as Name, Address, Phone Number, and
Email.
b) A JSON file containing product information with attributes like name, price, and
specifications.
c) Audio recordings of customer service calls.
d) A sales report in Excel format with rows and columns.
e) A collection of social media posts, including text, images, and hashtags.
f) A CSV file with daily temperature readings for the past year.
102