Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views18 pages

12 AI Unit 5 Introduction To Big Data and Data Analytics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views18 pages

12 AI Unit 5 Introduction To Big Data and Data Analytics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

ARTIFICIAL

INTELLIGENCE
CLASS XII

STUDENT HANDBOOK
2025-26

Subject Code:843
UNIT 5: Introduction to Big Data and Data Analytics
Title: Introduction to Big Data and Data Approach: Team discussion, Web search
Analytics

Summary: Students will delve into the world of Big Data, a game-changer in today's
digital age. Students gain insights into the various types of data and their unique
characteristics, equipping them to understand how this vast information is managed
and analysed. The journey continues as students discover the real-world applications
of Big Data and Data Analytics in diverse fields, witnessing how this revolutionary
concept is transforming how we approach data analysis to unlock new possibilities.

Learning Objectives:
1. Students will develop an understanding of the concept of Big Data and its
development in the new digital era.
2. Students will appreciate the role of big data in AI and Data Science.
3. Students will learn to understand the features of Big Data and how these
features are handled in Big Data Analytics.
4. Students will appreciate its applications in various fields and how this new
concept has evolved to bring new dimensions to Data Analysis.
5. Students will understand the term mining data streams.

Key Concepts:
1. Introduction to Big Data
2. Types of Big Data
3. Advantages and Disadvantages of Big Data
4. Characteristics of Big Data
5. Big Data Analytics
6. Working on Big Data Analytics
7. Mining Data Streams
8. Future of Big Data Analytics

Learning Outcomes:
Students will be able to –
1. Define Big Data and identify its various types.
2. Evaluate the advantages and disadvantages of Big Data.
3. Recognize the characteristics of Big Data.
4. Explain the concept of Big Data Analytics and its significance.
5. Describe how Big Data Analytics works.
6. Exploring the future trends and advancements in Big Data Analytics.

Prerequisites: Understanding the concept of data and reasonable fluency in the


English language.

86
5.1. What is Big Data?
To understand Big Data, let us first understand
small data.

Small data refers to datasets that are easily


comprehendible by people as they are easily
accessible, informative, and actionable, this makes
it ideal for individuals and businesses to find useful
information and make better choices in everyday
tasks. For example, a small store might track daily
sales to decide what products to restock.
Fig. 5.1 Sources of Big Data

Big Data refers to extremely large and complex datasets that regular computer programs
and databases cannot handle. It comes from three main sources: transactional data (e.g.,
online purchases), machine data (e.g., sensor readings), and social data (e.g., social media
posts). To analyze and use Big Data effectively, special tools and techniques are required.
These tools help organizations find valuable insights hidden in the data, which lead to
innovations and better decision-making. For example, companies like Amazon and Netflix
use Big Data to recommend products or shows based on users’ past activities.

5.2. Types of Big Data

Fig. 5.2

Semi-Structured
Aspect Structured Data Unstructured Data
Data
Quantitative data A mix of quantitative No inherent
Definition with a defined and qualitative structures or formal
structure properties rules
Dedicated data May lack a specific Lacks a consistent
Data Model
model data model data model

87
Semi-Structured
Aspect Structured Data Unstructured Data
Data
No organization
Organized in clearly Less organized than
Organization exhibits variability
defined columns structured data
over time
Accessibility depends
Easily accessible and Accessible but may
Accessibility on the specific data
searchable be harder to analyze
format
Examples Customer XML files, CSV files, Audio files, images,
information, JSON files, HTML video files, emails,
transaction records, files, PDFs, social media
product directories semi-structured posts
documents

5.3. Advantages and Disadvantages of Big Data:

Big Data is a key to modern innovation. It has changed how organizations analyze and use
information. While it offers great benefits, it also comes with challenges that affect its use
in different industries. In this section, we will be discussing a few pros and cons of big data.

Advantages:
● Enhanced Decision Making: Big Data analytics empowers organizations to make
data-driven decisions based on insights derived from large and diverse datasets.
● Improved Efficiency and Productivity: By analyzing vast amounts of data,
businesses can identify inefficiencies, streamline processes, and optimize resource
allocation, leading to increased efficiency and productivity.
● Better Customer Insights: Big Data enables organizations to gain a deeper
understanding of customer behavior, preferences, and needs, allowing for
personalized marketing strategies and improved customer experiences.
● Competitive Advantage: Leveraging Big Data analytics provides organizations with
a competitive edge by enabling them to uncover market trends, identify
opportunities, and stay ahead of competitors.
● Innovation and Growth: Big Data fosters innovation by facilitating the development
of new products, services, and business models based on insights derived from data
analysis, driving business growth and expansion.

88
Disadvantages:
● Privacy and Security Concerns: The collection, storage, and analysis of large
volumes of data raise significant privacy and security risks, including unauthorized
access, data breaches, and misuse of personal information.
● Data Quality Issues: Ensuring the accuracy, reliability, and completeness of data can
be challenging, as Big Data often consists of unstructured and heterogeneous data
sources, leading to potential errors and biases in analysis.
● Technical Complexity: Implementing and managing Big Data infrastructure and
analytics tools require specialized skills and expertise, leading to technical
challenges and resource constraints for organizations.
● Regulatory Compliance: Organizations face challenges in meeting data protection
laws like GDPR (General Data Protection Regulation) and The Digital Personal Data
Protection Act, 2023. These laws require strict handling of personal data, making
compliance essential to avoid legal risks and penalties.
● Cost and Resource Intensiveness: The cost of acquiring, storing, processing, and
analyzing Big Data, along with hiring skilled staff, can be high. This is especially
challenging for smaller organizations with limited budgets and resources.

Activity: Find the sources of big data using the link UNSTATS

5.4. Characteristics of Big Data


The “characteristics of Big Data” refer to the
defining attributes that distinguish large and
complex datasets from traditional data
sources. These characteristics are commonly
described using the "3Vs" framework:
Volume, Velocity, and Variety.
The 6Vs framework provides a holistic view of
Big Data, emphasizing not only its volume,
velocity, and variety but also its veracity,
variability, and value. Understanding and
addressing these six dimensions are essential
for effectively managing, analyzing, and
deriving value from Big Data in various
domains.
Fig. 5.3 Characteristics of Big Data

89
5.4.1. Velocity: Velocity refers to the speed at
which data is generated, delivered, and analyzed.
In the present world, where millions of people are
accessing and storing information online, the
speed at which the data gets stored or generated
is huge. For example: Google alone generates
more than 40,000 search queries per second. See
Fig. 5.4 Speed of data generation from various sources
the statistics in the picture provided. Isn’t it huge!

5.4.2. Volume: Every day a huge volume of data


is generated as the number of people using
online platforms has increased exponentially.
Such a huge volume of data is considered Big
Data. Typically, if the data volume exceeds
gigabytes, it falls into the realm of big data. This
volume can range from petabytes to terabytes or
even exabytes, based on surveys conducted by
various organizations. According to the latest
estimates, 328.77 million terabytes of data are Fig.5.5 Volume of data

created each day.

5.4.3. Variety: Big data encompasses data


in various formats, including structured,
unstructured, semi-structured, or highly
complex structured data. These can range
from simple numerical data to complex
and diverse forms such as text, images,
audio, videos, and so on. Storing and
processing unstructured data through
RDBMS is challenging. However,
unstructured data often provides valuable
insights that structured data cannot offer.
Additionally, the variety of data sources
within big data provides information on the
diversity of data.
Fig.5.6 Varieties in Big data

90
5.4.4. Veracity: Veracity is a characteristic in
Big Data related to consistency, accuracy,
quality, and trustworthiness. Not all data that
undergoes processing holds value.
Therefore, it is essential to clean data
effectively before storing or processing it,
especially when dealing with massive
volumes. Veracity addresses this aspect of
big data, focusing on the accuracy and
reliability of the data source and its suitability
Fig. 5.7
for analytical models.

5.4.5. Value: The goal of big data analysis lies in


extracting business value from the data. Hence, the
business value derived from big data is perhaps its
most critical characteristic. Without obtaining valuable
insights, the other characteristics of big data hold little
significance. So, in simple terms Value of Big Data
refers to the benefits the big data can provide.

Fig. 5.8 The value of Big Data

5.4.6. Variability: This refers to establishing if the


contextualizing structure of the data stream is regular and
dependable even in conditions of extreme unpredictability. It
defines the need to get meaningful data considering all possible
circumstances.

Fig. 5.9

Case Study: How a Company Uses 3V and 6V Frameworks for Big Data
Company: An OTT Platform ‘OnDemandDrama’
3V Framework:
Volume: OnDemandDrama processes huge amounts of data from millions of users, including watch
history, ratings, searches, and preferences to offer personalized content recommendations.
Velocity: Data is processed in real-time, allowing OnDemandDrama to immediately adjust
recommendations, track the patterns of the users, and offer trending content based on their
activity.
Variety: The platform handles diverse data such as user profiles, watch lists, video content, and
user reviews which are categorized as structured, semi-structured, and unstructured data.
6V Framework:
Along with the above 3 V of big data, the 6V Framework involves 3 more features of big data named
Veracity, Value, and Variability.

91
Veracity: OnDemandDrama filters out irrelevant or low-quality data (such as incomplete profiles) to
ensure accurate content recommendations.
Value: OnDemandDrama uses the data to personalize user experiences, driving engagement and
retention by recommending shows and movies that match individual tastes.
Variability: OnDemandDrama handles changes or inconsistencies in data streams caused by factors
like user behavior, trends, or any other external events. For example, user preferences can vary
based on region, time, or trends.
By using the 3V and 6V frameworks, OnDemandDrama can manage, process, and derive valuable
insights from its Big Data, which enhances customer satisfaction and drives business decisions.

5.5. Big Data Analytics


Data Analytics

Data analytics involves analyzing datasets to uncover


insights, trends, and patterns. It can be applied to datasets
of any size, from small to moderate volumes. Technologies
commonly used in data analytics include statistical
analysis software, data visualization tools, and relational
database management systems (RDBMS).

Big data analytics uses advanced analytic techniques against huge, diverse datasets that
include structured, semi-structured, and unstructured data, from different sources, and in
various sizes from terabytes to zettabytes.

Big-Data Analytics encompasses the


methodologies, tools, and practices involved in
analyzing and managing data, covering tasks
such as data collection, organization, and
storage. The primary objective of data analytics
is to utilize statistical analysis and technological
methods to uncover patterns and address
challenges. In the business realm, big data
analytics has gained significance as a means to
assess and refine business processes, as well as
enhance decision-making and overall business performance. It provides valuable insights
and forecasts that help businesses make informed decisions to improve their operations
and outcomes. Different types of Big Data Analytics can help businesses and organizations
find insights from large and complex datasets. Some of the common types are: Descriptive
analytics, Diagnostic analytics, Predictive analytics, and Prescriptive analytics, which we
have discussed in Unit 2 of Data Science Methodology.

92
Big Data Analytics emerges as a consequence of four significant global trends:
1. Moore’s Law: The exponential growth of computing power as per Moore's Law has
enabled the handling and analysis of massive datasets, driving the evolution of Big
Data Analytics.
2. Mobile Computing: With the widespread adoption of smartphones and mobile
devices, access to vast amounts of data is now at our fingertips, enabling real-time
connectivity and data collection from anywhere.
3. Social Networking: Platforms such as Facebook, Foursquare, and Pinterest facilitate
extensive networks of user-generated content, interactions, and data sharing,
leading to the generation of massive datasets ripe for analysis.
4. Cloud Computing: This paradigm shift in technology infrastructure allows
organizations to access hardware and software resources remotely via the Internet
on a pay-as-you-go basis, eliminating the need for extensive on-premises hardware
and software investments.

5.6. Working on Big Data Analytics


Big data analytics involves collecting, processing, cleaning, and analyzing enormous
datasets to improve organizational operations. The working process of big data analytics
includes the following steps –

Step 1. Gather data


Each company has a unique approach to data collection. Organizations can now collect
structured and unstructured data from various sources, including cloud storage, mobile apps,
and IoT sensors.

Step 2. Process Data


Once data is collected and stored, it must be processed properly to get accurate results on
analytical queries, especially when it’s large and unstructured. Various processing options
are available:
● Batch processing which looks at large data blocks over time.
● Stream processing looks at small batches of data at once, shortening the delay time
between collection and analysis for quicker decision-making.

Step 3. Clean Data


Scrubbing all data, regardless of size, improves quality and yields better results. Correct
formatting and elimination of duplicate or irrelevant data are essential. Erroneous and
missing data can lead to inaccurate insights.

Step 4. Analyze Data


Getting big data into a usable state takes time. Once it’s ready, advanced analytics processes
can turn big data into big insights.

Example: Data Analytics Tools – Tableau, APACHE Hadoop, Cassandra, MongoDB, SaS

93
Using Orange Data Mining for Big Data Analytics

We will explore how big data analysis can be performed using Orange Data Mining.

Step 1: Gather Data


1. Use the File widget to load data into Orange.
2. Load the desired dataset. For demonstration, we will use the built-in Heart Disease
dataset.

It is important to carefully study the dataset and understand the features and target
variable.

● Features: age, gender, chest pain, resting blood pressure (rest_spb), cholesterol,
resting ECG (rest_ecg), maximum heart rate (max_hr), etc.
● Target: diameter narrowing.

If the value for diameter narrowing is 1, it signifies significant narrowing of


the arteries, which is a risk factor for heart disease. If the value is 0, it
indicates healthier arteries with minimal or no narrowing.

94
Step 2: Process Data
Data processing involves preparing the data for accurate analysis. There are two methods:

1. Batch Processing: Use the Preprocess widget to normalize large chunks of


structured data at once.
2. Stream Processing (near-real-time): While Orange does not natively support live
stream data, you can incrementally process smaller subsets of the data in parallel
workflows.
Here, we will focus on the Normalization technique.
Normalization in data preprocessing refers to scaling numerical values to a specific range
(e.g., 0–1 or -1–1), making them comparable and improving the performance of machine
learning algorithms.

Step 2.1: Normalize Data


1. Connect the Preprocess widget to the File or Data Table widget.
2. Double-click on the Preprocess widget and select "Normalize Features".
3. Choose an interval, such as 0–1 or -1–1.

Step 2.2: Verify Normalized Data


1. Connect the Data Table widget to the Preprocess widget.
2. Open the Data Table to observe the differences in values.

You will see that all numerical values are now scaled between 0 and 1.

95
Step 3: Clean Data
Data cleaning is essential to ensure quality results. We will use the Impute widget to
handle missing values by replacing them with the mean, median, mode, or a custom
value. In this data we all can see that some values are missing in the figure below. This
missing value data set is being saved as heart data.xlsx in the computer folder.

Step 3.1: Upload Data


1. Use the File widget to upload a dataset with missing values.
2. Assign the role of "Target" to the feature you want to predict.

Step 3.2: Handle Missing Values


1. Connect the Impute widget to the File widget.
2. Double-click the Impute widget and select an imputation strategy:
Average (mean), Most frequent (mode), Fixed value, Random value

96
Step 3.3: Verify Cleaned Data
1. Connect the Data Table widget to the Impute widget.
2. Open the Data Table to confirm the missing values have been replaced.

Missing values are now filled with the chosen method (e.g., average values).

Step 4: Analyze Data


After cleaning, Orange provides various advanced analytics tools to extract insights:

● K-Means: For segmenting data into clusters.


● Logistic Regression / Decision Tree: For predicting outcomes using labeled data.
● Scatter Plot / Box Plot / Heat Map: For visualizing data patterns and relationships.
Step 4.1: Build a Logistic Regression Model
1. Drag and drop the Logistic Regression widget.
2. Connect it to the cleaned and normalized data.

Step 4.2: Test the Model


1. Add the Test and Score widget.
2. Connect the Test and Score widget to:
a. The Logistic Regression widget (learner data)
b. The processed data.

97
Step 4.3: Choose a Validation Method
1. Double-click the Test and Score widget. Select a validation method (e.g., Cross-
Validation).

Step 4.4: Generate Predictions


Connect the Predict widget to the Test and Score widget.

Check the predictions generated using the Logistic Regression model.

5.7. Mining Data Streams

To understand mining data streams, we first understand what data stream is. A data
stream is a continuous, real-time flow of data generated by various sources. These sources
can include sensors, satellite image data, Internet and web traffic, etc.
Mining data streams refers to the process of extracting meaningful patterns, trends,
and knowledge from a continuous flow of real-time data. Unlike traditional data mining, it
processes data as it arrives, without storing it completely. An example of an area where data
98
stream mining can be applied is website data. Websites typically receive continuous
streams of data daily. For instance, a sudden spike in searches for "election results" on a
particular day might indicate that elections were recently held in a region or highlight the
level of public interest in the results.

5.8. Future of Big Data Analytics

The future of Big Data Analytics is highly influenced by several key technological
advancements that will shape the way data is processed and analyzed. A few of them are:

Real-Time Analytics: It will allow businesses to process data instantaneously,


providing immediate insights for decision-making and enabling actions based on live data,
such as monitoring customer behavior or tracking supply chain activities.

Development of Advanced Models in Predictive Analytics: Predictive analytics will


evolve with the integration of more sophisticated machine learning and AI algorithms,
enabling organizations to forecast trends and behaviors with greater precision.

Quantum Computing: Quantum computing promises to revolutionize Big Data


analytics by offering unprecedented processing power. Quantum computers will be able to
solve complex problems much faster than classical computers.

-----------------------------------------------------------------------------------------------------

Activity 1: Note – This is a research-based group activity


i) Watch this video using the link https://www.youtube.com/watch?v=37x5dKW-X5U
ii) Form a group, explore the applications of Big Data & Data Analytics in the following
fields, and fill in the table given below:

Insights are drawn about this


Field Video resource field and its futuristic
development

Education

Environmental Science

Media and
Entertainment

99
Activity-2
List the steps involved in the working process of Big Data analytics.
Step 1:

Step 2:

Step 3:

Step 4:

EXERCISES

A. Multiple Choice questions


1. What does "Volume" refer to in the context of big data?
a) The variety of data types b) The speed at which data is generated
c) The amount of data generated d) The veracity of the data

2. Which of the following is a key characteristic of big data?


a) Structured format b) Easily manageable size
c) Predictable patterns d) Variety

3. Which of the following is NOT one of the V's of big data?


a) Velocity b) Volume c) Verification d) Variety

4. What is the primary purpose of data preprocessing in big data analytics?


a) To increase data volume b) To reduce data variety
c) To improve data quality d) To speed up data processing

5. Which technique is commonly used for analyzing large datasets to discover patterns
and relationships?
a) Linear regression b) Data mining c) Decision trees d) Naive Bayes

6. Which term describes the process of extracting useful information from large
datasets?
a) Data analytics b) Data warehousing c) Data integration d) Data virtualization

7. Which of the following is a potential benefit of big data analytics?


a) Decreased data security b) Reduced operational efficiency
c) Improved decision-making d) Reduced data privacy

100
8. What role does Hadoop play in big data processing?
a) Hadoop is a programming language used for big data analytics.
b) Hadoop is a distributed file system for storing and processing big data.
c) Hadoop is a data visualization tool.
d) Hadoop is a NoSQL database management system.

9. What is the primary challenge associated with the veracity aspect of big data?
a) Handling large volumes of data
b) Ensuring data quality and reliability
c) Dealing with diverse data types
d) Managing data processing speed

B. True or False

1. Big data refers to datasets that are too large to be processed by traditional
database systems.
2. Structured data is the primary type of data processed in big data analytics, making
up the majority of datasets.
3. Veracity refers to the trustworthiness and reliability of data in big data analytics
4. Real-time analytics involves processing and analyzing data as it is generated, without
any delay.
5. Cloud computing is the only concept used in Big Data Analytics.
6. A CSV file is an example of structured data.
7. “Positive, Negative, and Neutral” are terms related to Sentiment Analysis.
8. Data preprocessing is a critical step in big data analytics, involving cleaning,
transforming, and aggregating data to prepare it for analysis.
9. To analyze vast collections of textual materials to capture key concepts, trends, and
hidden relationships, the concept of Text mining is used.

C. Short answer questions


1. Define the term Big Data.
2. What does the term Volume refer to in Big Data?
3. Mention some important benefits of big data in the health sector.
4. Enlist the four types of Big Data Analytics.

101
D. Long answer questions
1. Explain the 6 V’s related to Big data.
2. Explain the differences between structured, semi-structured, and unstructured data.
3. Explain the process of Big Data Analytics.
4. Why is Big Data Analytics important in modern industries and decision-making
processes?
5. A healthcare company is using Big Data analytics to manage patient records, predict
disease outbreaks, and personalize treatments. However, the company is facing
challenges regarding data privacy, as patient information is highly sensitive. What
are the potential risks to patient privacy when using Big Data in healthcare, and how
can these be mitigated?
6. Given the following list of data types, categorize each as Structured, Unstructured,
or Semi-Structured:
a) A customer database with fields such as Name, Address, Phone Number, and
Email.
b) A JSON file containing product information with attributes like name, price, and
specifications.
c) Audio recordings of customer service calls.
d) A sales report in Excel format with rows and columns.
e) A collection of social media posts, including text, images, and hashtags.
f) A CSV file with daily temperature readings for the past year.

E. Competency Based Questions:


1. A retail clothing store is experiencing a decline in sales despite strong marketing
campaigns. You are tasked with using big data analytics to identify the root cause.
a. What types of customer data can be analyzed?
b. How can big data analytics be used to identify buying trends and customer
preferences?
c. Can you recommend specific data visualization techniques to present insights
to stakeholders?
d. How might these insights be used to personalize customer experiences and
improve sales?
2. A research institute is conducting a study on public sentiment towards environmental
conservation efforts. They aim to gather insights from various data sources to
understand public opinions and perceptions. They collect data from diverse sources
such as news articles, online forums, blog posts, and social media comments. Which
type of data does this description represent?

102

You might also like