Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views12 pages

Module 1 Notes

Uploaded by

akshaya.4924
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views12 pages

Module 1 Notes

Uploaded by

akshaya.4924
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Module1

Q1. Define Big Data. Explain its main characteristics with examples.
Big Data is a collection of large and complex datasets that cannot be captured, stored, managed, or
processed using traditional database management tools within a tolerable timeframe. It includes
structured, semi-structured, and unstructured data originating from heterogeneous sources such as
social media, sensors, emails, and transaction logs.
The characteristics of Big Data are commonly explained using the 5 V’s:

1. Volume – Refers to the scale of data. Organizations now deal with terabytes and petabytes of
data. For example, Facebook generates over 4 petabytes of data daily.
2. Velocity – Refers to the speed at which data is generated, collected, and processed. For
example, stock exchange data is generated within milliseconds and must be processed
instantly.
3. Variety – Refers to different formats of data: structured (tables), semi-structured (XML,
JSON), and unstructured (images, videos, emails). For example, an e-commerce site collects
customer transaction records, product reviews, and product images simultaneously.
4. Veracity – Refers to data uncertainty, inconsistencies, and trustworthiness. For instance, data
from social media may contain spam or misleading content.
5. Value – Refers to the ability of Big Data to provide useful insights for decision-making. For
example, Amazon uses customer browsing and purchase data to provide product
recommendations, increasing business profitability.

Big Data is therefore not only about managing large data volumes but also about analyzing diverse,
fast-moving, and complex data to create business value.

Q2. Discuss the evolution of Big Data and how it has transformed data
management.
The concept of Big Data evolved due to limitations in traditional systems when dealing with the
exponential growth of data. Its evolution can be traced through four major stages:

 Stage 1: Traditional Databases (Pre-2000): Data was mainly structured and stored in
relational databases. Organizations relied on OLAP and reporting tools for decision-making.
These systems worked well for gigabytes of data but struggled beyond that scale.
 Stage 2: Internet Explosion (2000–2005): With the rise of emails, online transactions, and
websites, unstructured and semi-structured data emerged. Traditional systems became
insufficient to handle the sudden growth.
 Stage 3: Big Data Technologies (2005 onwards): Google introduced MapReduce for
distributed processing and Yahoo developed Hadoop to manage massive unstructured data
sets using HDFS. This was the beginning of open-source Big Data tools.
 Stage 4: Current Scenario: Cloud platforms like AWS, Azure, and Google Cloud now offer
scalable storage and real-time analytics. AI and Machine Learning are increasingly integrated
into Big Data platforms for predictive analysis.

This evolution transformed data management by shifting from centralized RDBMS to distributed file
systems, from batch-only processing to real-time stream analytics, and from expensive proprietary
servers to cost-effective commodity clusters. As a result, organizations can now analyze diverse data
types at scale and make decisions in real time.

CIT, Bengaluru -1-


Q3. Differentiate between structured, semi-structured, and unstructured data
with examples.
Big Data consists of three major categories of data:

 Structured Data: Data that is organized into fixed schemas such as rows and columns. It is
easy to query using SQL. Example: Banking transactions, student records, and airline
bookings.
 Semi-Structured Data: Data that does not strictly follow tabular form but has markers or tags
that provide structure. Example: JSON, XML, and emails with headers and metadata.
 Unstructured Data: Data without a predefined schema, which cannot be stored in RDBMS
easily. Specialized tools such as Hadoop, Spark, or NoSQL are used for storage and analysis.
Example: Images, videos, free-form text, social media posts, and satellite images.

Comparison:
Feature Structured Data Semi-Structured Data Unstructured Data

Format Tables (rows & columns) Tags/metadata-based Free-form, irregular

Storage RDBMS NoSQL, Hadoop Hadoop, Cloud

Example Payroll records JSON, XML, sensor logs Videos, tweets, images

Organizations deal with all three types of data, making Big Data technologies critical to managing
and analyzing these diverse formats effectively.

Q4. Explain the importance of Big Data in modern business decision-making.


Big Data has become vital for organizations because of its ability to transform raw information into
meaningful business insights. Some key areas where it impacts decision-making include:

1. Customer Insights: Companies analyze purchasing behavior, browsing history, and social
media activity to understand customer preferences. Netflix, for example, uses Big Data to
recommend shows and movies tailored to each user.
2. Operational Efficiency: Organizations optimize operations by analyzing machine data, logs,
and supply chain records. Airlines use Big Data to improve flight scheduling and reduce fuel
costs.
3. Fraud Detection and Security: Financial institutions track real-time transactions to detect
anomalies. Credit card companies, for example, immediately flag suspicious purchases.
4. Risk Management: Predictive analytics is used to foresee risks and manage uncertainties.
Insurance companies calculate premiums based on Big Data analysis of past claims.
5. Innovation and Product Development: Market data is used to identify trends, leading to the
creation of innovative products and services. Smartphone companies launch new features after
studying user feedback and usage data.
6. Real-time Decision Making: Big Data enables instant analysis, allowing businesses to act
immediately. E-commerce platforms adjust prices dynamically during peak sales seasons.

Through these applications, Big Data ensures that businesses remain competitive by supporting
evidence-based and timely decisions across industries such as retail, healthcare, finance, and
manufacturing.

CIT, Bengaluru -2-


Q5. Compare Traditional Business Intelligence and Big Data approaches.
Traditional Business Intelligence (BI) was designed to analyze structured data from enterprise
systems and provide historical reports. Big Data, in contrast, is designed to handle huge volumes of
structured, semi-structured, and unstructured data for real-time, predictive, and prescriptive insights.
Comparison:
Aspect Traditional BI Big Data Analytics

Data Type Structured (rows and columns) Structured, semi-structured, unstructured

Volume Handles GB–TB of data Handles TB–PB–ZB of data

Processing Batch processing, periodic reports Batch + Real-time processing

Storage RDBMS, Data Warehouses Hadoop Distributed File System, NoSQL

Scalability Limited, vertical scaling Highly scalable, distributed clusters

Tools SQL, OLAP, Data Warehouses Hadoop, Spark, Hive, Pig, HBase

Focus Historical, descriptive analytics Predictive and prescriptive analytics

For example, in retail, traditional BI can generate end-of-month sales reports, whereas Big Data can
analyze clickstream and social media data in real time to recommend products instantly. This shows
that BI is better for historical, structured data analysis, while Big Data analytics offers advanced
capabilities suitable for modern, data-driven enterprises.

Q6. Draw and explain the architecture of a typical data warehouse environment.
A data warehouse is a central repository that stores historical data from multiple sources for reporting
and decision-making. The architecture typically consists of the following components:
1. Data Sources:

 Includes transactional databases, ERP systems, CRM systems, and external sources.
 Data can be structured or semi-structured.

2. ETL (Extract, Transform, Load):

 Extracts data from sources, cleans and transforms it into a common format, and loads it into
the warehouse.
 Tools like Informatica, Talend, or SQL-based ETL are used.

3. Data Warehouse Database:

 Central storage where integrated data is kept.


 Uses RDBMS or OLAP-based systems.
 Organized into fact tables (measurable data) and dimension tables (descriptive attributes).

4. Metadata Repository:

 Stores definitions, mappings, and rules for ETL and queries.

CIT, Bengaluru -3-


 Helps maintain data consistency.

5. Front-end/Access Tools:

 Provides interfaces for querying, reporting, dashboards, and analytics.


 Tools: Business Objects, Cognos, Power BI.

Typical Architecture Diagram (textbook):

 Data Sources → ETL → Data Warehouse → Metadata → OLAP/Reporting Tools.

A typical data warehouse supports business intelligence by integrating data into a single platform,
enabling organizations to generate reports, perform trend analysis, and support decision-making based
on historical information.

Q7. Illustrate the architecture of a Hadoop ecosystem and its major components.
Hadoop is an open-source Big Data framework that allows distributed storage and parallel processing
of large datasets. Its architecture consists of:
1. Hadoop Distributed File System (HDFS):

 Storage layer of Hadoop.


 Stores data across clusters in blocks (default 128 MB).
 Provides fault tolerance through replication (default replication factor = 3).

2. MapReduce:

 Processing engine of Hadoop.


 Divides tasks into “Map” (splitting and processing data) and “Reduce” (aggregating results).

3. YARN (Yet Another Resource Negotiator):

 Resource management layer.


 Allocates resources to applications and manages scheduling.

4. Hadoop Common:

 Utilities and libraries required by other Hadoop components.

Ecosystem Components:

 Hive: Data warehousing on Hadoop using SQL-like queries.


 Pig: High-level scripting language for data analysis.
 HBase: NoSQL database on top of HDFS.
 Sqoop: For transferring data between RDBMS and Hadoop.
 Flume: Collects and loads streaming data into HDFS.

Architecture Diagram (from book):

 HDFS (storage) + MapReduce/YARN (processing) + Ecosystem tools.

CIT, Bengaluru -4-


This architecture enables scalable, fault-tolerant storage and efficient parallel data processing, making
Hadoop the backbone of Big Data analytics.

Q8. Discuss the limitations of traditional data warehouses in handling Big Data.
Although data warehouses are effective for structured and historical analysis, they face challenges in
the Big Data era:

1. Data Variety:

 Data warehouses mainly handle structured data in tabular format.


 They cannot efficiently manage semi-structured (JSON/XML) or unstructured (video, audio)
data.

2. Scalability:

 Vertical scaling (adding more resources to a single machine) is costly and limited.
 Warehouses cannot handle petabyte-scale data efficiently.

3. Processing Speed:

 Designed for batch reporting and historical analysis.


 Not suitable for real-time or near real-time analytics.

4. Cost:

 Proprietary software and hardware make warehouses expensive to scale.

5. Flexibility:

 Schema rigidity makes it difficult to adapt to fast-changing data sources.


 Every new data type requires redesigning ETL processes.

6. Integration Issues:

 Modern data comes from diverse sources like IoT, social media, and logs, which are difficult
to integrate into a warehouse.

Thus, while traditional warehouses are useful for historical reporting and structured data analysis,
they fail to meet the demands of Big Data, which requires scalability, flexibility, and real-time
processing.

Q9. Explain with an example how Hadoop addresses the challenges of Big Data.
Hadoop overcomes the challenges of Big Data in the following ways:

1. Scalability:

 Hadoop runs on clusters of commodity hardware.


 New nodes can be added easily to scale horizontally.

CIT, Bengaluru -5-


2. Fault Tolerance:

 HDFS replicates data blocks across multiple nodes.


 If a node fails, data can still be accessed from other replicas.

3. Cost Efficiency:

 Uses inexpensive hardware and open-source software.


 Reduces the need for high-end servers.

4. Variety Support:

 Handles structured, semi-structured, and unstructured data.


 Tools like Hive, Pig, and HBase extend Hadoop for different data formats.

5. High Processing Power:

 MapReduce processes large datasets in parallel across clusters.


 Provides high throughput and efficiency.

Example (from textbook):


In e-commerce, Hadoop can store and process billions of clickstream logs generated by users.
MapReduce can analyze browsing patterns to recommend products, while Hive can generate sales
reports. This would be difficult to achieve with traditional warehouses.
Hadoop therefore addresses Big Data challenges by offering a scalable, cost-effective, and flexible
ecosystem for distributed storage and processing.

Q10. Differentiate between typical data warehouse and Hadoop environment in


terms of data storage and processing.
Comparison (based on textbook):
Feature Data Warehouse Hadoop Environment

Data Type Structured only Structured, semi-structured, unstructured

Data Storage Centralized storage on high-end servers Distributed storage using HDFS

Scalability Vertical scaling (costly, limited) Horizontal scaling (add commodity nodes)

Processing Batch, SQL-based OLAP Batch + Real-time using MapReduce/Spark

Cost Expensive proprietary systems Low cost, open-source, commodity hardware

Flexibility Schema is rigid, hard to change Schema-on-read, flexible for diverse data

Examples of Tools ETL, OLAP, BI tools Hadoop, Hive, Pig, Spark, HBase

For instance, a data warehouse can generate monthly sales reports from structured ERP data, whereas
Hadoop can analyze structured sales data along with unstructured social media reviews in real time.
This makes Hadoop more suitable for modern Big Data requirements.

CIT, Bengaluru -6-


Q11. Define Big Data Analytics. Explain its role in decision-making.
Big Data Analytics refers to the process of examining large and varied data sets to uncover hidden
patterns, unknown correlations, market trends, customer preferences, and other useful business
information. It applies advanced analytical techniques on Big Data to support evidence-based
decisions.
Role in Decision-Making:

1. Descriptive Insights: Helps summarize historical data to understand what has happened. For
example, analyzing last year’s sales by region.
2. Diagnostic Insights: Identifies reasons behind past outcomes. For instance, determining why
customer churn rates increased in a particular quarter.
3. Predictive Insights: Uses statistical models and machine learning to forecast future outcomes.
For example, predicting customer demand for a product.
4. Prescriptive Insights: Recommends the best actions to achieve desired results. For instance,
suggesting optimal pricing strategies during peak shopping seasons.
5. Real-time Decision Support: Big Data Analytics enables organizations to respond instantly.
For example, fraud detection systems block suspicious credit card transactions as they occur.

Thus, Big Data Analytics plays a central role in improving decision-making by providing
organizations with timely, data-driven, and actionable insights that go beyond traditional business
intelligence.

Q12. Explain the classification of analytics (descriptive, diagnostic, predictive,


prescriptive) with examples.
Analytics can be broadly classified into four types:

1. Descriptive Analytics:
o Focuses on summarizing past data and identifying trends.
o Example: Monthly sales reports or web traffic analysis.
2. Diagnostic Analytics:
o Explains reasons behind certain outcomes by drilling down into data.
o Techniques like data mining and correlation analysis are used.
o Example: Analyzing why sales dropped in a particular region by looking at customer
feedback and competitor activity.
3. Predictive Analytics:
o Uses statistical models, machine learning, and historical data to forecast future events.
o Example: Predicting which customers are likely to leave a telecom service provider.
4. Prescriptive Analytics:
o Provides recommendations for the best course of action based on predictive outcomes.
o Example: Suggesting personalized offers to customers predicted to churn, to retain
them.

These four types build upon each other, starting from understanding the past (descriptive) to
prescribing future actions (prescriptive), making them essential in the Big Data Analytics lifecycle.

CIT, Bengaluru -7-


Q13. Discuss the importance of Big Data Analytics in various industries (e.g.,
healthcare, retail, banking).
Big Data Analytics provides industry-specific advantages by enabling smarter, data-driven decisions:

1. Healthcare:
o Analyzes patient records, genetic data, and clinical trials to improve diagnosis and
treatment.
o Example: Predictive models forecast disease outbreaks based on population health
data.
2. Retail and E-commerce:
o Personalizes customer experience using recommendation engines.
o Example: Amazon suggests products by analyzing browsing and purchase history.
3. Banking and Finance:
o Detects fraud by analyzing transaction patterns in real time.
o Example: Banks block suspicious credit card transactions automatically.
4. Manufacturing:
o Improves efficiency by analyzing sensor data from machines (Industrial IoT).
o Example: Predictive maintenance reduces downtime and costs.
5. Telecommunications:
o Reduces churn by analyzing call patterns and customer complaints.
o Example: Telecoms offer discounts to at-risk customers to retain them.
6. Public Sector:
o Enhances governance by analyzing social media sentiment and demographic data.
o Example: Smart city projects use Big Data to manage traffic and utilities.

These applications show that Big Data Analytics is not confined to IT companies but has become a
backbone for innovation and efficiency across all sectors.

Q14. Write a note on the technologies used in Big Data environments such as
HDFS, MapReduce, Spark, etc.
Several technologies form the backbone of Big Data environments:

1. HDFS (Hadoop Distributed File System):


o A distributed storage system that splits data into blocks and replicates them across
nodes for fault tolerance.
o Allows large-scale data storage at low cost.
2. MapReduce:
o A programming model for parallel data processing.
o “Map” breaks down tasks into smaller chunks, while “Reduce” aggregates results.
o Example: Counting the frequency of words in millions of documents.
3. Apache Spark:
o In-memory processing framework faster than MapReduce.
o Supports batch, streaming, machine learning (MLlib), and graph analytics (GraphX).
o Widely used for real-time analytics.
4. NoSQL Databases:
o Examples: MongoDB, Cassandra, HBase.
o Handle semi-structured and unstructured data efficiently.
5. YARN (Yet Another Resource Negotiator):
o Resource manager in Hadoop.

CIT, Bengaluru -8-


o Allocates resources dynamically for various applications.
6. Ecosystem Tools:
o Hive: Data warehouse infrastructure with SQL-like queries.
o Pig: Scripting platform for analyzing large data sets.
o Sqoop: Transfers data between Hadoop and RDBMS.
o Flume: Collects streaming data from sources like logs and social media.

These technologies work together to provide scalable, flexible, and efficient platforms for storing and
analyzing Big Data.

Q15. List and briefly explain top analytical tools used in Big Data Analytics.
The textbook highlights several widely used tools for Big Data Analytics:

1. Apache Hadoop:
o Open-source framework for distributed storage (HDFS) and batch processing
(MapReduce).
2. Apache Spark:
o In-memory, fast processing framework that supports batch and real-time analytics.
3. Hive:
o Provides SQL-like querying for Hadoop data.
o Suitable for analysts familiar with SQL.
4. Pig:
o High-level scripting platform for analyzing large datasets.
5. HBase:
o A NoSQL database built on HDFS, ideal for real-time read/write access.
6. Tableau/QlikView/Power BI:
o Visualization tools used to represent Big Data insights graphically for business users.
7. R and Python:
o Programming languages with strong support for statistical analysis, machine learning,
and visualization.
8. MongoDB and Cassandra:
o Popular NoSQL databases used for handling semi-structured and unstructured data.

Each tool serves a unique purpose, ranging from storage and batch processing (Hadoop) to real-time
analytics (Spark) and visualization (Tableau), making them integral to a complete Big Data Analytics
ecosystem.

Q16. What is NoSQL? Explain its types and advantages over traditional RDBMS.
NoSQL (Not Only SQL) databases are designed to handle large-scale, diverse, and rapidly changing
data that traditional RDBMS cannot manage effectively. Unlike relational databases, NoSQL systems
are schema-less, horizontally scalable, and optimized for Big Data applications.
Types of NoSQL Databases:

1. Key-Value Stores:
o Store data as key-value pairs.
o Example: Redis, DynamoDB.
o Use Case: Caching, session management.
2. Document Stores:

CIT, Bengaluru -9-


oStore data in documents (JSON, BSON, XML).
oExample: MongoDB, CouchDB.
oUse Case: Content management, user profiles.
3. Column-Oriented Stores:
o Data is stored in columns rather than rows.
o Example: Cassandra, HBase.
o Use Case: Analytics and real-time big data applications.
4. Graph Databases:
o Store entities as nodes and relationships as edges.
o Example: Neo4j.
o Use Case: Social networks, fraud detection.

Advantages over RDBMS:

 Scalability: Horizontal scaling across commodity servers.


 Flexibility: Schema-less design accommodates changing data models.
 Performance: Optimized for high read/write throughput.
 Variety Support: Handles structured, semi-structured, and unstructured data.

NoSQL databases are therefore better suited for Big Data environments where speed, flexibility, and
scalability are critical.

Q17. Discuss the features and advantages of Hadoop in Big Data environments.
Features of Hadoop:

1. Distributed Storage (HDFS): Splits data into blocks and replicates across nodes for fault
tolerance.
2. Parallel Processing (MapReduce): Processes data in parallel across clusters.
3. Scalability: Easily scales horizontally by adding commodity hardware.
4. Fault Tolerance: Automatic recovery from node failures using replication.
5. Open Source and Cost Effective: Freely available framework running on inexpensive
hardware.
6. Flexibility: Handles structured, semi-structured, and unstructured data.
7. Ecosystem Integration: Works with Hive, Pig, Spark, HBase, Flume, and other tools.

Advantages:

 Handles petabytes of data efficiently.


 Supports diverse data formats from multiple sources.
 Provides high availability and reliability through replication.
 Reduces cost by eliminating dependency on high-end servers.

Hadoop thus provides the foundation for storing, processing, and analyzing Big Data in a scalable and
fault-tolerant environment.

CIT, Bengaluru -10-


Q18. Explain the working of HDFS (Hadoop Distributed File System) with a neat
diagram.
HDFS is the storage layer of Hadoop designed to store very large files reliably across multiple
machines.
Key Concepts:

1. NameNode:
o Master node that maintains metadata (file names, locations, block mappings).
o Does not store actual data, only information about data.
2. DataNodes:
o Worker nodes that store actual data blocks.
o Periodically send heartbeat signals to the NameNode.
3. Block Storage:
o Files are divided into fixed-size blocks (default 128 MB).
o Blocks are replicated (default replication factor = 3) across nodes for fault tolerance.

Working:

 When a client uploads a file, the NameNode splits it into blocks and assigns DataNodes to
store each block.
 Replicas are created automatically to ensure fault tolerance.
 During retrieval, the NameNode provides block locations, and the client fetches them directly
from DataNodes.

Diagram (from textbook):


Client → NameNode (Metadata) → DataNodes (Block storage + replication)
HDFS thus ensures scalability, fault tolerance, and efficient distributed storage, making it the
backbone of the Hadoop framework.

Q19. Compare SQL vs NoSQL databases in the context of Big Data processing.
SQL Databases (RDBMS):

 Schema-based, table-driven structure.


 Suitable for structured data and ACID (Atomicity, Consistency, Isolation, Durability)
transactions.
 Examples: MySQL, Oracle, PostgreSQL.

NoSQL Databases:

 Schema-less, flexible, and horizontally scalable.


 Suitable for semi-structured and unstructured data.
 Examples: MongoDB, Cassandra, HBase.

Comparison Table:
Feature SQL Databases NoSQL Databases

Data Model Tables (rows & columns) Key-Value, Document, Column, Graph

CIT, Bengaluru -11-


Feature SQL Databases NoSQL Databases

Schema Fixed, rigid Dynamic, schema-less

Scalability Vertical (add CPU/RAM) Horizontal (add more nodes)

Data Types Structured Structured + Semi + Unstructured

Transactions Strong ACID support BASE (Basically Available, Soft state, Eventual consistency)

Use Case Banking, ERP, CRM Social media, IoT, Big Data apps

In Big Data contexts, NoSQL is preferred for its ability to handle diverse and evolving data at scale,
while SQL remains strong for transactional systems requiring consistency.

Q20. Explain the role of MapReduce in processing Big Data with an example.
MapReduce is a programming model used in Hadoop for processing large datasets in parallel. It
consists of two major functions:

1. Map Phase:
o Input data is divided into key-value pairs.
o Each “Map” function processes data independently and outputs intermediate key-value
pairs.
2. Reduce Phase:
o Aggregates the intermediate results generated by the Map tasks.
o Produces final output.

Example (Word Count Program):

 Input: A set of documents.


 Map Step: Each word in the documents is mapped with a count of 1 → (“Big”, 1), (“Data”,
1).
 Shuffle and Sort: Intermediate pairs are grouped by key → (“Big”, [1,1]), (“Data”, [1,1,1]).
 Reduce Step: Values are summed → (“Big”, 2), (“Data”, 3).

Role in Big Data:

 Enables distributed parallel processing of massive datasets.


 Simplifies complex tasks such as log analysis, recommendation systems, and sentiment
analysis.
 Provides fault tolerance by rerunning failed tasks automatically.

MapReduce thus plays a central role in enabling Hadoop to process Big Data efficiently, making
large-scale analytics feasible on commodity clusters.
----------------------------------------------------------*****--------------------------------------------------------------------

CIT, Bengaluru -12-

You might also like