Activity-2
List the steps involved in the working process of Big Data analytics.
Step 1:
Step 2:
Step 3:
Step 4:
Solution:
Step 1- Gather data
Step 2- Process data
Step 3-Clean data
Step 4-Analyse data
EXERCISES
A. Multiple Choice questions
1. What does "Volume" refer to in the context of big data?
a) The variety of data types b) The speed at which data is generated
c) The amount of data generated d) The veracity of the data
2. Which of the following is a key characteristic of big data?
a) Structured format b) Easily manageable size
c) Predictable patterns d) Variety
3. Which of the following is NOT one of the V's of big data?
a) Velocity b) Volume c) Verification d) Variety
4. What is the primary purpose of data preprocessing in big data analytics?
a) To increase data volume b) To reduce data variety
c) To improve data quality d) To speed up data processing
5. Which technique is commonly used for analyzing large datasets to discover patterns
and relationships?
a) Linear regression b) Data mining c) Decision trees d) Naive Bayes
6. Which term describes the process of extracting useful information from large
datasets?
a) Data analytics b) Data warehousing c) Data integration d) Data virtualization
124
7. Which of the following is a potential benefit of big data analytics?
a) Decreased data security b) Reduced operational efficiency
c) Improved decision-making d) Reduced data privacy
8. What role does Hadoop play in big data processing?
a) Hadoop is a programming language used for big data analytics.
b) Hadoop is a distributed file system for storing and processing big data.
c) Hadoop is a data visualization tool.
d) Hadoop is a NoSQL database management system.
9. What is the primary challenge associated with the veracity aspect of big data?
a) Handling large volumes of data
b) Ensuring data quality and reliability
c) Dealing with diverse data types
d) Managing data processing speed
B. True or False
1. Big data refers to datasets that are too large to be processed by traditional
database systems. (True)
2. Structured data is the primary type of data processed in big data analytics, making
up the majority of datasets. (False)
3. Veracity refers to the trustworthiness and reliability of data in big data analytics.
(True)
4. Real-time analytics involves processing and analyzing data as it is generated, without
any delay. (True)
5. Cloud computing is the only concept used in Big Data Analytics. (False)
6. A CSV file is an example of structured data. (False)
7. “Positive, Negative, and Neutral” are terms related to Sentiment Analysis. (True)
8. Data preprocessing is a critical step in big data analytics, involving cleaning,
transforming, and aggregating data to prepare it for analysis. (True)
9. To analyze vast collections of textual materials to capture key concepts, trends, and
hidden relationships, the concept of Text mining is used. (True)
125
C. Short answer questions
1. Define the term Big Data.
Ans - Big Data refers to a vast collection of data that is characterized by its immense
volume, which continues to expand rapidly over time.
2. What does the term Volume refer to in Big Data?
Ans - Volume refers to the quantity of data to be stored. In case of big data, huge quantity
of data is generated in a very short period of time. For example, Walmart deals with big
data. They handle more than 1 million customer transactions every hour, importing more
than 2.5 petabytes of data into their database.
3. Mention some important benefits of big data in the health sector.
Ans –
● predictive analysis for predicting disease outbreak, patient analysis and other health
risks.
● Personalized medicine
● Clinical decision support
● healthcare resource management
4. Enlist the four types of Big Data Analytics.
Ans - The four types of Big Data Analytics are:
1. Descriptive Analytics: Summarizes historical data to identify patterns and trends.
2. Diagnostic Analytics: Analyses past data to understand the reasons behind specific
outcomes.
3. Predictive Analytics: Uses historical data to forecast future events or trends.
4. Prescriptive Analytics: Recommends actions to achieve desired outcomes based on
data insights.
These types are designed to provide insights at different levels of decision-making and
problem-solving
D. Long answer questions
1. Explain the 6 V’s related to Big data.
The 6 V's of Big Data are:
I. Volume: Refers to the massive amount of data generated daily, ranging from
terabytes to exabytes. For example, 328.77 million terabytes of data are created
every day.
126
II. Velocity: Describes the speed at which data is generated, delivered, and
analyzed. For instance, Google processes over 40,000 search queries per second.
III. Variety: Indicates the different forms of data, including structured (e.g.,
databases), semi-structured (e.g., XML files), and unstructured data (e.g., videos,
images, and social media posts).
IV. Veracity: Focuses on the accuracy, quality, and trustworthiness of data. It
involves ensuring that data is reliable and suitable for analytical models by
addressing inconsistencies or inaccuracies.
V. Value: Refers to the insights and business benefits that can be extracted from Big
Data. Without deriving value, the other characteristics hold little significance.
VI. Variability: Highlights the inconsistencies or unpredictability in data flow,
requiring systems to adapt and extract meaningful insights even in dynamic
conditions
2. Explain the differences between structured, semi-structured, and unstructured data.
(Teachers can add a few more points if required.)
3.Explain the process of Big Data Analytics.
Ans- The process of Big Data Analytics can be divided broadly into four major
steps. They are as follows:
Step 1. Gather data
Each company has a unique approach to data collection. Organizations can now
collect structured and unstructured data from various sources, including cloud
storage, mobile apps, and IoT sensors.
Step 2. Process Data
Once data is collected and stored, it must be processed properly to get accurate
results on analytical queries, especially when it’s large and unstructured. The
analysis can be done either batch wise or stream wise.
127
Step 3. Clean Data
Scrubbing all data, regardless of size, improves quality and yields better results.
Correct formatting and elimination of duplicate or irrelevant data are essential. Dirty
data can lead to inaccurate insights.
Step 4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced analytics
processes can turn big data into big insights.
4.Why is Big Data Analytics important in modern industries and decision-making
processes?
Big Data Analytics is important in modern industries and decision-making processes
because it:
1. Enables Data-Driven Decisions: By analyzing vast and diverse datasets,
organizations can make informed decisions based on insights and trends.
2. Improves Efficiency and Productivity: Identifying inefficiencies and optimizing
resource allocation helps streamline processes.
3. Enhances Customer Insights: Understanding customer behavior and preferences
enables personalized marketing and improved customer experiences.
4. Provides Competitive Advantage: Leveraging analytics helps organizations
uncover market trends, identify opportunities, and stay ahead of competitors.
5. Fosters Innovation and Growth: Insights derived from data analysis drive the
development of new products, services, and business models.
5.A healthcare company is using Big Data analytics to manage patient records, predict
disease outbreaks, and personalize treatments. However, the company is facing
challenges regarding data privacy, as patient information is highly sensitive. What are the
potential risks to patient privacy when using Big Data in healthcare, and how can these be
mitigated?
Potential Risks to Patient Privacy:
1. Unauthorized Access: Sensitive patient information could be accessed by
unauthorized individuals, leading to breaches of confidentiality.
2. Data Breaches: Cyberattacks could expose patient data to malicious actors.
3. Misuse of Personal Information: Patient data might be used for purposes beyond
its intended scope, such as marketing or profiling.
128
4. Regulatory Non-Compliance: Failing to comply with data protection laws like
GDPR or the Digital Personal Data Protection Act, 2023, could lead to legal and
financial penalties.
Mitigation Strategies:
1. Data Encryption: Encrypt data during storage and transmission to protect against
unauthorized access.
2. Access Controls: Implement strict access controls to ensure that only authorized
personnel can access sensitive data.
3. Anonymization: Remove personally identifiable information (PII) from datasets to
safeguard patient identity during analysis.
4. Regular Audits: Conduct regular security audits to identify and address
vulnerabilities.
5. Compliance with Regulations: Adhere to data protection laws to ensure ethical
handling of sensitive information.
6. Employee Training: Educate staff about data privacy practices and the importance
of protecting patient information
6.Given the following list of data types, categorize each as Structured, Unstructured, or
Semi-Structured:
a) A customer database with fields such as Name, Address, Phone Number, and
Email. Structured
b) A JSON file containing product information with attributes like name, price, and
specifications. Semi-Structured
c) Audio recordings of customer service calls. Unstructured
d) A sales report in Excel format with rows and columns. Structured
e) A collection of social media posts, including text, images, and hashtags.
Unstructured
f) A CSV file with daily temperature readings for the past year. Structured
129
E. Competency Based Questions:
1. A retail clothing store is experiencing a decline in sales despite strong marketing
campaigns. You are tasked with using big data analytics to identify the root cause.
a. What types of customer data can be analyzed?
b. How can big data analytics be used to identify buying trends and customer
preferences?
c. Can you recommend specific data visualization techniques to present insights
to stakeholders?
d. How might these insights be used to personalize customer experiences and
improve sales?
Ans:
a. Analyze purchase history (items bought together, frequency, time of
purchase), demographics (age, location, income), and browsing behavior
(clicks, time spent on product pages) of the customer.
b. Big data analytics can help
i. identify items that are frequently purchased together to optimize
product placement and promotions.
ii. group customers based on demographics and buying habits
iii. track customer journeys on the website, identify areas of improvement
(e.g., checkout process)
c. understand the key metrics (sales by category, customer demographics) for
easy stakeholder comprehension, and to understand the customer browsing
behavior on the website (hotspots which indicate the items of interest).
d. These insights will help the application to
i. recommend relevant products based on a customer's purchase history
and browsing behavior.
ii. tailor promotions and advertisements to specific customer segments.
iii. adjust prices based on demand and customer demographics.
2. A research institute is conducting a study on public sentiment towards environmental
conservation efforts. They aim to gather insights from various data sources to
understand public opinions and perceptions. They collect data from diverse sources
such as news articles, online forums, blog posts, and social media comments. Which
type of data does this description represent?
Ans: Unstructured data
3. A global e-commerce platform is experiencing rapid growth in its user base, with
millions of transactions occurring daily across various product categories. As part of
their data analytics efforts, they are focused on improving the speed and efficiency
of processing incoming data to provide real-time recommendations to users during
their browsing and purchasing journeys. Identify the specific characteristic of big
data (6V's of Big Data) that is most relevant in the above scenario and justify your
answer.
130
Ans:
In the scenario described, the most relevant characteristic of big data from the 6V's
perspective is Velocity. The reason being it highlights the need for the e-commerce
platform to handle the high speed at which data is generated from millions of
transactions daily. The platform needs to process this data quickly to provide real-time
recommendations during a user's browsing and purchasing journey. Delays in
processing could lead to missed opportunities to influence customer decisions.
Reference links:
• https://www.researchgate.net/publication/259647558_Data_Stream_Mining
• https://www.ibm.com/topics/big-data-analytics
• https://www.researchgate.net/figure/olume-scale-of-Data-from-different-data-
sources-26_fig1_324015815Fig
131