Q7) a) Significance of the Rise of AI-powered Big Data
Analytics in Data-driven Decision-making sBusiness
Intelligence [6 Marks]
The integration of AI with Big Data Analytics has transformed how
businesses make decisions Cderive insights. Its significance includes:
1. Enhanced Predictive Analytics: AI algorithms can analyze
historical data to forecast trends, consumer behavior, Cmarket
dynamics, aiding proactive decision-making.
2. Real-time Decision Making:AI enables rapid processing of
massive data streams in real-time, crucial for time-sensitive
sectors like finance, healthcare, Ce-commerce.
3. Improved Accuracy sEfficiency:AI minimizes human error
Cautomates data processing, providing accurate insights faster
than traditional methods.
4. Personalization sCustomer Insights: AI analyzes
customer data to deliver personalized recommendations
Ctargeted marketing, boosting customer engagement
Cloyalty.
5. Cost Optimization: By identifying inefficiencies Coptimizing
operations, AI-powered analytics help reduce costs Callocate
resources more effectively.
6. Competitive Advantage: Organizations using AI in analytics
gain deeper insights than competitors relying solely on
conventional analytics, enabling better strategic planning.
b) Advantages of Cassandra over Traditional Relational
Databases [6 Marks] Apache Cassandra is a NoSQL distributed
database designed for handling large volumes of data across many
servers. Its advantages over traditional RDBMS
include:
1. High Scalability: Cassandra offers horizontal scalability,
allowing seamless addition of nodes without downtime,
unlike RDBMS that scale vertically Ccan be limited.
2. Fault Tolerance sHigh Availability: Data is
automatically replicated across multiple nodes, ensuring no
single point of failure Ccontinuous availability.
3. Decentralized Architecture: All nodes in Cassandra
are avoiding bottlenecks typical of master-slave relational
databases.
4. Write sRead Performance: Optimized for high-speed writes
Ccan handle massive write loads with low latency—ideal for IoT,
logs, Csensor data.
5. Flexible Data Model: Supports dynamic schema changes
Ccomplex data types without downtime, whereas RDBMS require
rigid schemas.
6. Big Data Integration: Easily integrates with big data tools
like Hadoop, Spark, CKafka, making it well-suited for modern
data pipelines Canalytics.
c) What is Apache Spark? Explain the main
components of Spark architecture. [5]
Apache Spark is an open-source, distributed computing
system designed for fast processing of large-scale data. It supports
in-memory computing, making it significantly faster than traditional
big data tools like Hadoop MapReduce. Spark is widely used for data
analytics, machine learning, stream processing, Cgraph
computation.
Main Components of Spark Architecture:
1. Driver Program:
o Acts as the main controller of the Spark application.
o It converts user code into tasks, schedules them,
Cmanages their execution on the cluster.
2. Cluster Manager:
o Responsible for managing the cluster resources.
o Spark can work with various cluster managers like:
Standalone
Kubernetes
o It allocates resources to different applications.
3. Executors:
o Worker processes that run individual tasks Cstore data
for processing (in-memory or on disk).
o Each Spark application has its own set of executors that
are launched once Crun for the entire lifetime of the
application.
4. Tasks:
o The smallest unit of work in Spark.
o Each task is assigned by the driver to executors Cis part of a job
stage.
5. RDDs (Resilient Distributed Datasets):
o The core data structure in Spark.
o Immutable, distributed collections of objects that can be
processed in parallel.
Q 8 a) What is Dark Data? Explain the Different Types of Dark
Data [6 Marks] Definition of Dark Data:
Dark Data refers to the data that organizations collect, process,
Cstore during regular business activities but do not use for
analysis, decision-making, or
business intelligence. This data remains "in the dark" due to lack of
awareness, tools, or perceived value.
Types of Dark Data:
1. Log Files: System, application, Csecurity logs that are
stored but often ignored unless issues arise.
2. Email sCommunication Archives: Emails, chat logs, Ccall
recordings stored for compliance but not analyzed for insights.
3. Sensor sMachine Data: IoT devices, manufacturing
equipment,
Cnetwork hardware generate data often stored without further
processing.
4. Customer Support Records:Past service tickets, chat
transcripts, Ccall recordings that could provide insights but are
rarely analyzed.
5. Social Media Data: Unused interactions, likes,
comments, Cshares collected via social media monitoring
tools.
6. Document Repositories:Reports, spreadsheets, CPDFs
stored in file systems or SharePoint without metadata tagging
or indexing.
b) Explain the Following Terms: i)
Streaming Analytics [6 Marks] Streaming Analytics refers to
the real-time processing Canalysis of continuous data streams. It
involves extracting insights from data as it is generated, without
storing it first.
Key Features:
Processes data in motion (e.g., sensor data, clickstreams,
financial transactions).
Supports real-time alerting, monitoring, Cdecision-making.
Common tools: Apache Kafka, Apache Flink, Apache Storm, Spark
Streaming.
Applications:
Fraud detection in banking.
Real-time recommendation engines.
Network monitoring Ccybersecurity.
Predictive maintenance in manufacturing.
ii) Real-time Analytics [6 Marks]
Real-time Analytics involves the immediate processing
sanalysis of data as it arrives, allowing users to gain insights
Cmake decisions instantly.
Key Features:
Uses live data or data with minimal latency.
Provides up-to-date dashboards, KPIs, or alerts.
May include both streaming Cfast batch processing techniques.
Applications:
Real-time traffic navigation systems.
Stock market monitoring Ctrading.
Healthcare monitoring (e.g., patient vitals).
Live customer behavior tracking on e-commerce platforms.
c) What is Apache Cassandra? Explain Its Key
Features [6 Marks] What is Apache Cassandra
Apache Cassandra is an open-source, distributed NoSQL
database designed to handle large volumes of data across
multiple servers with high availability, fault tolerance, Cno
single point of failure. It is best suited for applications that require
scalability, performance, C24/7 uptime.
Originally developed by Facebook Clater open-sourced, Cassandra is
now part of
the Apache Software
Foundation. Key Features of
Apache Cassandra:
1. High Availability sFault Tolerance:
o Cassandra ensures data is always accessible by replicating
it across multiple nodes Cdata centers. If one node fails,
others continue serving data.
2. Scalability (Horizontal Scaling):
o Easily scales out by adding more nodes without downtime.
Performance increases linearly with the addition of hardware.
3. Decentralized / Peer-to-Peer Architecture:
o All nodes in the cluster are equal—there is no master
node. This eliminates bottlenecks Csingle points of
failure.
4. High Write Performance:
o Optimized for high-speed writes, making it suitable for write-
intensive applications like IoT, messaging apps, Creal-time
analytics.
5. Flexible Schema (Schema-less):
o Allows dynamic changes to the data model without
affecting existing applications. Ideal for evolving
applications Cvariable data formats.
6. Tunable Consistency:
o Developers can choose between strong consistency
Ceventual consistency, based on application needs.
7. Support for Distributed Replication:
o Supports replication across multiple geographic locations,
improving data locality Cdisaster recovery.
8. CQL (Cassandra Query Language):
o Uses a SQL-like language (CQL) for querying, making it
easier for developers with RDBMS backgrounds to
adapt.
Use Cases:
Real-time analytics * Internet of Things (IoT)
Social media platforms * Messaging apps
Recommendation systems