Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
77 views41 pages

THEORY FILE - Data Warehouse and Mining (5th Sem) .

Uploaded by

sahil gupta.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views41 pages

THEORY FILE - Data Warehouse and Mining (5th Sem) .

Uploaded by

sahil gupta.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

THEORY FILE : Data Warehouse and Mining .

(FULL NOTES: BY SAHIL RAUNIYAR) .

SUBJECT CODE: UGCA: 1931

BACHELOR OF COMPUTER APPLICATIONS

MAINTAINED BY: TEACHER’S /MAM’:

IL
Sahil Kumar Prof.

COLLEGE ROLL NO: 226617

UNIVERSITY ROLL NO: 2200315


H
SA

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING

BABA BANDA SINGH BAHADUR ENGINEERING

COLLEGE FATEGARH SAHIB


1

Program ➖➖
BCA

Course Name ➖ Semester 5th.


Data Warehouse and Mining(Theory).

UNIT ➖01

● Need for strategic information, difference between operational and Informational
data stores

Strategic Information ➖
Strategic information is vital for organisations as it supports long-term planning and

IL
decision-making processes. It helps executives and senior managers to understand the
competitive landscape, market trends, and internal performance metrics, thereby guiding the
organisation towards achieving its strategic goals.

Key Needs for Strategic Information:

I. Informed Decision-Making: Helps leaders make data-driven decisions.


II. Competitive Advantage: Identifies market opportunities and threats.
III.
IV.
V.
H
Performance Monitoring: Tracks progress towards strategic objectives.
Resource Allocation: Assists in the optimal allocation of resources.
Trend Analysis: Identifies long-term trends and patterns.


Difference Between Operational and Informational Data Stores in Data Warehousing and
Data Mining
SA
Data Warehousing involves the storage, retrieval, and management of large volumes of data
from different sources to support business analysis and reporting. Data Mining involves
analysing large datasets to find patterns, correlations, and insights for decision-making.

Operational Data Stores (ODS) ➖


Operational Data Stores are used for the day-to-day operations of a business. They support the
transactional processes of the organisation and are characterised by frequent updates and
real-time data access.

Characteristics: ➖
I. Real-Time Data: Provides up-to-date information.
II. Frequent Updates: Data is continuously updated as transactions occur.
III. Transaction-Oriented: Focuses on current operations and transactions.
IV. Normalisation: Data is highly normalised to eliminate redundancy and ensure data
integrity.

Uses: ➖
I. Order Processing: Managing customer orders.
2
II. Inventory Management: Tracking stock levels.
III. Customer Relationship Management (CRM): Managing customer interactions.

Informational Data Stores (Data Warehouses) :-


Informational Data Stores are used for business analysis and decision support. They aggregate
data from various sources, providing a comprehensive view of the organisation's data over time.

Characteristics:
I. Historical Data: Contains historical data for trend analysis.
II. Periodic Updates: Data is updated periodically (e.g., daily, weekly).
III. Analysis-Oriented: Supports complex queries and data analysis.
IV. De-normalization: Data is often de-normalized to optimise query performance.

IL
Uses:
I. Business Intelligence: Generating reports and dashboards.
II. Data Mining: Discovering patterns and insights.
III. Strategic Planning: Supporting long-term decision-making.

Conclusion
Understanding the differences between operational and informational data stores is crucial for
H
effectively leveraging data warehousing and mining techniques. While operational data stores
support the daily transactional processes, informational data stores provide the historical and
comprehensive view necessary for strategic decision-making and analysis.

● Data warehouse definition, characteristics, Data warehouse role and structure,


SA
OLAP Operations, Data mart, Different between data mart and data warehouse,


Approaches to build a data warehouse, Building a data warehouse, Metadata & its
types

Data Warehouse Definition ➖


A data warehouse is a centralised repository designed to store integrated data from multiple
sources, optimised for query and analysis. It provides a consolidated view of an organisation's
data, enabling advanced reporting and business intelligence.

Characteristics of a Data Warehouse ➖


I. Subject-Oriented: Organised around key subjects or business processes.
II. Integrated: Combines data from various sources into a cohesive format.
III. Time-Variant: Historical data is stored and can be analysed over time.
IV. Non-Volatile: Once data is loaded, it is not altered or deleted.



Role and Structure of a Data Warehouse
Role:
I. Decision Support: Facilitates informed decision-making by providing comprehensive data
analysis.
3
II. Business Intelligence: Supports reporting, online analytical processing (OLAP), and data
mining.
III. Data Consolidation: Integrates disparate data sources into a unified view.

Structure: ➖
I. Source Systems: Operational databases, external data sources.
II. Staging Area: Intermediate storage for ETL (Extract, Transform, Load) processes.
III. Data Storage: Central repository (data warehouse) organised in schemas like star,
snowflake.
IV. Data Presentation: Access layer for reporting and analysis tools.

OLAP Operations ➖
I. Roll-Up: Aggregating data along a dimension.

IL
II. Drill-Down: Breaking data into finer levels of detail.
III. Slice: Selecting a single dimension to create a subset.
IV. Dice: Creating a subcube by selecting multiple dimensions.
V. Pivot: Rotating data axes to view from different perspectives.

Data Mart ➖
A data mart is a subset of a data warehouse, focused on a specific business area or
department. It is designed for a particular group of users, offering tailored data and faster query
performance.
H
Differences Between Data Mart and Data Warehouse

Scope:
Data Mart: Specific to a business area.
SA
Data Warehouse: Enterprise-wide.

Size:
Data Mart: Smaller, more focused datasets.
Data Warehouse: Larger, comprehensive datasets.

Users:
Data Mart: Specific departmental users.
Data Warehouse: Broad range of users across the organization.

Implementation Time:
Data Mart: Quicker to implement.
Data Warehouse: Takes longer to build due to broader scope.

Approaches to Build a Data Warehouse

Top-Down Approach: ➖
I. Starts with designing the overall data warehouse.
II. Later builds data marts from the centralised data warehouse.

4
Bottom-Up Approach:
I. Begins with building data marts for specific business areas.
II. Integrates these data marts into an enterprise-wide data warehouse over time.

Hybrid Approach: ➖
I. Combines elements of both top-down and bottom-up approaches.
II. Building a Data Warehouse

Requirements Analysis: ➖

I. Understand business needs and data requirements.
Data Modelling:
I. Design conceptual, logical, and physical data models.

IL
ETL Process:
Extract data from source systems, transform it into a suitable format, and load it into the data
warehouse.

Data Warehouse Architecture:


I. Define hardware and software architecture.

Implementation:
H
I. Set up the data warehouse environment and load data.

Testing and Validation:


I. Ensure data accuracy, performance, and reliability.

Deployment and Maintenance:


SA
I. Go live and maintain the data warehouse with regular updates and optimizations.

Metadata & Its Types ➖


Metadata is data about data. It describes the structure, operations, and contents of the data
warehouse.

Business Metadata: ➖
I. Describes the data from a business perspective (e.g., definitions, business rules).

Technical Metadata:
I. Describes technical aspects (e.g., data types, ETL processes, table structures).

Operational Metadata:
I. Contains information about system operations (e.g., data lineage, ETL job logs, error
reports).

Process Metadata:
Details about the processes that manipulate data (e.g., ETL transformations, scheduling
information).
5
● Data warehouse definition, characteristics, Data warehouse role and structure,
OLAP Operations, Data mart, Different between data mart and data warehouse,


Approaches to build a data warehouse, Building a data warehouse, Metadata & its
types

Data Warehouse Definition ➖


A data warehouse is a centralised repository that stores large volumes of data from multiple
sources. This data is consolidated, transformed, and made available for analysis and reporting.
It is designed to support decision-making processes by providing historical data and enabling
complex queries and analysis.

Characteristics of a Data Warehouse ➖


I. Subject-Oriented: Organised around key subjects, such as customers, sales, or products,

IL
rather than specific business processes.

II. Integrated: Consolidates data from various sources into a consistent format, with uniform
naming conventions, measurements, and encoding structures.

III. Non-Volatile: Once data is entered into the data warehouse, it is not updated or deleted.
This ensures a stable historical record.

IV.
H
Time-Variant: Contains historical data to track changes over time, often with time stamps.

Role of a Data Warehouse ➖


A data warehouse serves as a central repository for integrated data, supporting business
intelligence (BI) activities like reporting, querying, and analysis. It helps organisations make
informed decisions by providing a comprehensive view of historical and current data.


SA
Structure of a Data Warehouse
I. Data Sources: Various operational systems, external data sources, and other data
repositories.

II. ETL Processes: Extract, Transform, Load processes that gather data from sources, clean
and transform it, and load it into the warehouse.

III. Data Storage: Centralised storage area where data is organised into schemas like star or
snowflake schemas.

IV. Metadata: Data about the data, including definitions, mappings, and data lineage.

V. Access Tools: Query and reporting tools, OLAP (Online Analytical Processing) tools, and
data mining tools.

OLAP Operations ➖
OLAP tools allow users to interactively analyse multidimensional data from multiple
perspectives.

6
Common OLAP operations include:
I. Slice: Isolating a single layer of a data cube.

II. Dice: Creating a sub-cube by selecting specific values for multiple dimensions.

III. Drill Down/Up: Navigating through the levels of data, from summary to detailed data and
vice versa.

IV. Pivot (Rotate): Reorienting the multidimensional view of data to look at it from different
perspectives.

Data Mart ➖
A data mart is a subset of a data warehouse focused on a specific business area, department,

IL
or subject. It provides a more limited, subject-specific view of data and is often easier and faster
to implement than a full data warehouse.

Differences Between Data Mart and Data Warehouse ➖


I. Scope: Data warehouses are enterprise-wide, while data marts are department-specific.

II. Data Integration: Data warehouses integrate data from multiple sources, whereas data
marts may focus on a single source or a few sources.

III.

IV.
H
Size: Data warehouses are typically much larger in size than data marts.

Implementation Time: Data marts can be implemented more quickly compared to data
warehouses.


SA
Approaches to Build a Data Warehouse
I. Top-Down Approach: Start with a comprehensive data warehouse design and
implementation, followed by the creation of data marts.

II. Bottom-Up Approach: Start with the implementation of data marts, which are then
integrated into a comprehensive data warehouse.

III. Hybrid Approach: Combines elements of both top-down and bottom-up approaches,
implementing core components of a data warehouse along with subject-specific data
marts.

Building a Data Warehouse ➖


I. Requirements Gathering: Understand the business requirements and objectives.

II. Data Modeling: Design the data warehouse schema, choosing between star, snowflake,
or galaxy schemas.

III. ETL Development: Create ETL processes to extract data from source systems, transform
it, and load it into the warehouse.
7
IV. Data Loading: Load the historical data into the warehouse.

V. Metadata Management: Develop and maintain metadata to ensure data consistency and
transparency.

VI. Deployment: Implement the data warehouse and make it accessible to end-users.

VII. Maintenance and Evolution: Regularly update and maintain the warehouse to
accommodate changing business needs.

Metadata and Its Types ➖


Metadata is data that describes other data, providing context and information about the
structure, operations, and usage of the data.

IL
Types of metadata include: ➖
I. Technical Metadata: Details about data sources, data structures, data transformations,
and storage (e.g., schema definitions, data lineage).

II. Business Metadata: Describes the meaning and context of data from a business
perspective (e.g., business terms, data ownership).
H
Operational Metadata: Information about the operations performed on the data, including ETL
processes and data quality metrics (e.g., job logs, error logs).
SA
8

UNIT ➖02
● Data Pre-processing: Need, Data Summarization, Methods. Denormalization,
Multidimensional data model, Schemas for multidimensional data (Star schema,
Snowflake Schema, Fact Constellation Schema, Difference between different
schemas) .

Data Pre-processing ➖
● Need for Data Pre-processing

IL
Data pre-processing is a crucial step in the data mining process because real-world data is
often incomplete, noisy, and inconsistent. Pre-processing transforms raw data into an
understandable format.

Key reasons for pre-processing include: ➖


I. Improving Data Quality: Ensures data is accurate, complete, and consistent.

II. Enhancing Analysis: Makes data suitable for analysis, ensuring accurate and reliable

III.
results.
H
Data Integration: Combines data from multiple sources to provide a unified view.

IV. Reducing Complexity: Simplifies data to reduce computational requirements and improve
performance.


SA
● Data Summarization
Data summarization involves creating compact representations of data sets.

Key methods include: ➖


I. Descriptive Statistics: Measures such as mean, median, mode, standard deviation, and
range to summarise data.

II. Data Cube Aggregation: Aggregating data across multiple dimensions to create summary
data cubes.

III. Data Visualization: Graphical representations like histograms, pie charts, and scatter plots
to summarise data visually.

IV. Data Reduction: Techniques like dimensionality reduction and data compression to create
smaller representations of the data.

9
● Methods of Data Pre-processing
I. Data Cleaning: Handling missing values, removing noise, correcting inconsistencies.
A. Techniques: Imputation, smoothing, interpolation, and outlier detection.

II. Data Integration: Combining data from multiple sources to create a coherent data set.
B. Techniques: Schema integration, entity identification, and redundancy elimination.

III. Data Transformation: Converting data into suitable formats or structures.


C. Techniques: Normalisation, scaling, discretization, and attribute construction.

IV. Data Reduction: Reducing the volume of data while maintaining its integrity.
D. Techniques: Principal Component Analysis (PCA), sampling, clustering, and aggregation.

IL
● Denormalization
Denormalization is the process of combining normalised tables into fewer tables to improve
query performance. It introduces redundancy to reduce the number of joins required in queries,
which can significantly speed up data retrieval. Denormalization is commonly used in OLAP
systems to enhance performance.

● Multidimensional Data Model ➖


A multidimensional data model is used in OLAP systems to represent data in multiple

geography, product).
H
dimensions. It allows for complex queries and analysis across different dimensions (e.g., time,

The key components include: ➖


I. Dimensions: Attributes or perspectives through which data can be analysed (e.g., time,
location, product).
SA
II. Measures: Quantitative data points (e.g., sales, profit) that are analysed along
dimensions.

III. Data Cubes: Multidimensional arrays of values, used to represent data along different
dimensions.

● Schemas for Multidimensional Data ➖


Star Schema
I. Structure: Central fact table connected to multiple dimension tables.
II. Characteristics: Simple, easy to understand, and fast query performance.
III. Fact Table: Contains measures and foreign keys to dimension tables.
IV. Dimension Tables: Contain descriptive attributes related to dimensions.

● Snowflake Schema ➖
I. Structure: Similar to star schema but with normalised dimension tables.
II. Characteristics: Reduces redundancy, more complex queries, slower performance
compared to star schema.
10
III. Fact Table: Contains measures and foreign keys to dimension tables.
IV. Dimension Tables: Can be normalised into multiple related tables.

● Fact Constellation Schema (Galaxy Schema) ➖


I. Structure: Multiple fact tables sharing dimension tables.

II. Characteristics: Suitable for complex applications, can represent multiple business
processes.

III. Fact Tables: Multiple fact tables related to different processes or subjects.

IV. Dimension Tables: Shared among fact tables, reducing redundancy.

IL
● Differences Between Schemas
I. Complexity: Star schema is the simplest, followed by snowflake schema, and then fact
constellation schema.

II. Performance: Star schema usually offers the best performance due to fewer joins, while
snowflake schema can be slower due to normalisation. Fact constellations can be
complex to query.

III.
H
Redundancy: Snowflake schema reduces redundancy compared to star schema by
normalising dimension tables. Fact constellation shares dimensions across fact tables to
reduce redundancy.

IV. Use Cases: Star schema is often used for simple queries and reporting, snowflake
schema for data warehousing with normalised data, and fact constellation for complex
SA
analytical applications involving multiple business processes.


● Data warehouse architecture, OLAP servers, Indexing OLAP Data, OLAP query
processing, Data cube computation

● Data Warehouse Architecture ➖


Data warehouse architecture consists of several layers that facilitate the storage, processing,
and retrieval of data for analytical purposes. The common layers include:

Data Source Layer: ➖


I. Operational Databases: Source systems like ERP, CRM, legacy systems.
II. External Data Sources: Market data, social media, third-party data.

Data Staging Layer: ➖


I. ETL Processes: Extract, Transform, Load processes for cleaning, transforming, and
loading data into the warehouse.
11

Data Storage Layer: ➖


I. Data Warehouse Repository: Centralised storage where processed data is stored, often
using relational databases or data lakes.

II. Metadata Repository: Stores metadata about the data, including schema definitions, data
lineage, and transformation rules.

Data Presentation Layer: ➖


I. Data Marts: Subject-specific subsets of the data warehouse tailored for departmental use.

II. OLAP Cubes: Pre-aggregated, multidimensional views of data for fast querying.

IL
Data Access Layer:
I. Query and Reporting Tools: BI tools for generating reports and dashboards.

II. OLAP Tools: Tools for multidimensional analysis.

III. Data Mining Tools: Tools for discovering patterns and insights.

● OLAP Servers ➖
H
OLAP (Online Analytical Processing) servers support complex analytical queries and
multidimensional data analysis.

Types of OLAP servers include: ➖


1. MOLAP (Multidimensional OLAP): ➖
SA
I. Storage: Data is stored in multidimensional cubes.
II. Performance: High query performance due to pre-aggregation.
III. Advantages: Fast query response, optimised storage for multidimensional data.
IV. Disadvantages: Data redundancy, limited scalability for very large data sets.

2. ROLAP (Relational OLAP): ➖


I. Storage: Data is stored in relational databases, and multidimensional views are created
on-the-fly.
II. Performance: Slower than MOLAP due to on-the-fly computation.
III. Advantages: Scalability, can handle large data sets, no data redundancy.
IV. Disadvantages: Slower query performance compared to MOLAP.

3. HOLAP (Hybrid OLAP): ➖


I. Storage: Combines features of MOLAP and ROLAP, storing some data in
multidimensional cubes and some in relational databases.
II. Performance: Balances between query performance and scalability.
12
III. Advantages: Flexibility, balanced performance, and scalability.
IV. Disadvantages: Complexity in implementation and maintenance.

● Indexing OLAP Data ➖


Indexing in OLAP systems enhances query performance by allowing faster data retrieval.

Common indexing techniques include: ➖


1. Bitmap Indexes: ➖
I. Usage: Suitable for columns with low cardinality.
II. Structure: Uses bitmaps (binary vectors) to represent the presence or absence of values.
III. Advantages: Fast query performance for AND/OR operations.

IL
IV. Disadvantages: Less efficient for high cardinality columns.

2. B-Tree Indexes: ➖
I. Usage: Suitable for high-cardinality columns.
II. Structure: Balanced tree structure for indexing data.
III. Advantages: Efficient range queries and sorting.
IV. Disadvantages: Slower performance for AND/OR operations compared to bitmap
indexes.

3. Join Indexes:
H

I. Usage: Pre-computes join operations between fact and dimension tables.
II. Structure: Stores mappings between the rows of joined tables.
SA
III. Advantages: Speeds up join operations, reducing query execution time.
IV. Disadvantages: Increased storage requirements.

4. OLAP Query Processing ➖


OLAP query processing involves executing complex analytical queries on multidimensional
data.

Key components include: ➖


I. Query Parsing: Analysing and validating the query syntax.

II. Query Optimization: Determining the most efficient execution plan.

III. Query Execution: Performing the query operations, such as aggregations, filtering, and
sorting.

IV. Result Presentation: Displaying the query results in a user-friendly format, such as reports
or dashboards.
13

● Data Cube Computation ➖


Data cube computation involves creating multidimensional arrays of data to support fast
querying and analysis.

● Key steps include: ➖


I. Aggregation: Calculating aggregated values for different combinations of dimensions
(e.g., total sales by region and time).

II. Roll-Up: Aggregating data along a dimension hierarchy (e.g., daily to monthly sales).

III. Drill-Down: Breaking down aggregated data into finer granularity (e.g., monthly to daily
sales).

IL
IV. Slicing: Extracting a single layer of the data cube (e.g., sales for a specific region).

V. Dicing: Creating a sub-cube by selecting specific values for multiple dimensions (e.g.,
sales for a specific region and time period).

● Efficient data cube computation techniques include: ➖



I.

II.
H
Materialisation: Pre-computing and storing aggregated values for fast retrieval. Can
be full (all possible aggregates) or partial (only frequently used aggregates).

Online Aggregation: Computing aggregates on-the-fly during query execution.

III. Multidimensional Indexing: Using indexes to speed up access to specific slices or dice
SA
of the data cube.
14

UNIT ➖ 03

● Data Mining: Definition, Data Mining process, Data mining methodology, Data
mining tasks, Mining various Data types & issues.

Data Mining ➖
● Definition ➖
Data mining is the process of discovering patterns, correlations, trends, and anomalies within
large sets of data through the use of various techniques such as machine learning, statistics,

IL
and database systems. It aims to transform raw data into meaningful information for
decision-making purposes.

● Data Mining Process ➖


1. Business Understanding: ➖
I. Define the project objectives and requirements from a business perspective.

II.
H
Formulate data mining goals based on the business objectives.

2. Data Understanding: ➖
I. Collect initial data and familiarise yourself with it.

II. Identify data quality issues and discover initial insights.


SA
3.Data Preparation:
I. Cleanse the data by handling missing values, outliers, and inconsistencies.
II. Transform data into suitable formats for mining (e.g., normalisation, discretization).
III. Integrate data from multiple sources.

4. Modelling: ➖
I. Select appropriate data mining techniques (e.g., classification, clustering, association).
II. Build and calibrate models based on the prepared data.
III. Evaluate models for accuracy and performance.

5. Evaluation: ➖
I. Assess the models to ensure they meet business objectives.
II. Compare results with expectations and refine the models as needed.

6. Deployment: ➖
I. Implement the data mining results into the decision-making process.
II. Monitor and maintain the models over time to ensure they remain effective.

15
● Data Mining Methodology

1. CRISP-DM (Cross-Industry Standard Process for Data Mining): ➖


I. A widely-used methodology comprising six phases: business understanding, data
understanding, data preparation, modelling, evaluation, and deployment.

2. SEMMA (Sample, Explore, Modify, Model, Assess): ➖


I. A methodology developed by SAS, focusing on the exploratory and modelling aspects of
data mining.

● Data Mining Tasks ➖


1. Classification:

IL
I. Assigning items to predefined categories or classes (e.g., spam detection in emails).

2. Regression:
I. Predicting a continuous numeric value based on input features (e.g., predicting house
prices).

3. Clustering:
I. Grouping similar items together based on their features (e.g., customer segmentation).
H
4. Association Rule Mining: ➖
I. Discovering interesting relationships between variables in large databases (e.g., market
basket analysis).

5. Anomaly Detection: ➖
SA
I. Identifying unusual or rare items that do not conform to expected patterns (e.g., fraud
detection).

6. Sequential Pattern Mining: ➖


I. Finding regular sequences or patterns in time-ordered data (e.g., purchase sequence
analysis).

● Mining Various Data Types & Issues ➖


● Types of Data: ➖
I. Relational Data: Structured data stored in relational databases.
II. Transactional Data: Data from transactional systems like sales transactions.
III. Temporal Data: Time-series data capturing changes over time.
IV. Spatial Data: Data related to geographical or spatial information.
V. Text Data: Unstructured data in text format.
VI. Multimedia Data: Data in formats like images, audio, and video.
VII. Web Data: Data from web sources, including web logs and social media.

16
● Issues in Data Mining:
I. Data Quality: Handling missing, noisy, and inconsistent data.
II. Scalability: Managing large-scale data mining to ensure efficiency.
III. Data Privacy and Security: Ensuring the protection of sensitive data.
IV. Interpretability: Making the results of data mining understandable and actionable for
decision-makers.
V. Integration: Combining data from various sources and ensuring consistency.
VI. Dynamic Data: Adapting to changes in data and models over time.

● Detailed Breakdown of Data Mining Tasks ➖


1. Classification ➖
I. Techniques: Decision trees, neural networks, support vector machines, k-nearest

IL
neighbours.
II. Applications: Spam detection, medical diagnosis, credit scoring.

2. Regression ➖
I. Techniques: Linear regression, polynomial regression, logistic regression.
II. Applications: Predicting prices, demand forecasting, risk assessment.

3. Clustering ➖
H
I. Techniques: K-means, hierarchical clustering, DBSCAN.
II. Applications: Market segmentation, image segmentation, anomaly detection.

4. Association Rule Mining ➖


I. Techniques: Apriori algorithm, Eclat algorithm, FP-growth algorithm.
II. Applications: Market basket analysis, cross-selling strategies, web usage mining.


SA
5. Anomaly Detection
I. Techniques: Isolation forests, one-class SVM, statistical methods.
II. Applications: Fraud detection, network security, fault detection.

6. Sequential Pattern Mining ➖


Techniques: PrefixSpan, SPADE, GSP.
Applications: Customer purchase patterns, web clickstream analysis, gene sequence analysis.
17


● Attribute-Oriented Induction, Association rule mining, Frequent itemset mining,
The Apriori Algorithm, Mining multilevel association rules

● Attribute-Oriented Induction ➖
Attribute-oriented induction is a data mining technique that generalises data by abstracting
low-level data into higher-level concepts. This method is particularly useful for simplifying large
datasets and uncovering meaningful patterns.

Process: ➖
I. Data Collection: Gather relevant data from various sources.
II. Attribute Generalisation: Replace specific values of attributes with more general concepts

IL
based on a concept hierarchy.
III. Generalization Control: Determine the level of abstraction by specifying thresholds or
stopping criteria.
IV. Attribute Reduction: Eliminate irrelevant or less significant attributes to focus on key
aspects of the data.
V. Rule Generation: Derive general rules and patterns from the generalized data.

Applications: ➖
H
I. Summarising large datasets
II. Knowledge discovery in databases
III. Data reduction and preprocessing for other data mining tasks

● Association Rule Mining ➖


Association rule mining identifies interesting relationships or patterns among a set of items in
SA
transactional databases.

These rules help in understanding how items co-occur.

Concepts: ➖
I. Support: The frequency of an itemset in the dataset. Support(X) = (Number of
transactions containing X) / (Total number of transactions)
II. Confidence: The likelihood that the presence of one item leads to the presence of
another. Confidence(X -> Y) = Support(X ∪ Y) / Support(X)
III. Lift: Measures the strength of an association rule compared to the random occurrence of
the items. Lift(X -> Y) = Confidence(X -> Y) / Support(Y)

● Frequent Itemset Mining ➖


Frequent itemset mining is the process of finding itemsets that appear frequently in a dataset.
These item sets are the basis for generating association rules.

Key Steps: ➖
I. Identify frequent itemsets: Find all itemsets with support above a specified threshold.
18
II. Generate strong association rules: Use the frequent itemsets to generate rules that meet
minimum support and confidence thresholds.

● The Apriori Algorithm ➖


The Apriori algorithm is a classic algorithm for frequent itemset mining and association rule
learning over transactional databases.

Steps: ➖
I. Generate Candidate Itemsets: Start with single items and iteratively generate larger
itemsets.
II. Prune Infrequent Itemsets: Remove candidate itemsets that do not meet the minimum
support threshold.
III. Repeat: Continue generating and pruning candidate itemsets until no more frequent

IL
itemsets can be found.
IV. Generate Rules: Use the frequent itemsets to generate association rules.

Pseudocode:


sql
Ex
Ck: Candidate itemset of size k
Lk: Frequent itemset of size k

L1 = {frequent items};
H
for (k = 1; Lk != ∅; k++) {
Ck+1 = candidates generated from Lk;
for each transaction t in database {
increment the count of all candidates in Ck+1 that are contained in t
SA
}
Lk+1 = candidates in Ck+1 with min_support
}

return ∪k Lk;

● Mining Multilevel Association Rules ➖


Multilevel association rules are rules that involve items at different levels of abstraction. These
rules provide more comprehensive insights by considering a hierarchy of concepts.

Process: ➖
I. Define Concept Hierarchies: Establish hierarchies for items (e.g., "electronics" ->
"computers" -> "laptops").
II. Mine Frequent Itemsets at Each Level: Start from the lowest level and mine frequent
itemsets.
III. Generate Rules: Generate association rules at each level of abstraction.
IV. Prune and Refine: Apply thresholds for support and confidence at different levels to refine
the rules.

19
Example:
I. Low-Level Rule: {HP Laptop, Wireless Mouse} -> {Laptop Bag}
II. High-Level Rule: {Laptop, Mouse} -> {Bag}

Summary ➖
I. Attribute-Oriented Induction: Generalizes data to higher-level concepts, simplifying and
summarising large datasets.

II. Association Rule Mining: Discovers interesting relationships among items in a


transactional database.

III. Frequent Itemset Mining: Identifies itemsets that appear frequently in the data.

IL
IV. The Apriori Algorithm: A key algorithm for mining frequent itemsets and generating
association rules.

V. Mining Multilevel Association Rules: Extracts rules from items at different levels of
abstraction for comprehensive insights.
H
SA
20

UNIT ➖04
● Overview of classification, Classification process, Decision tree, Decision Tree
Induction, Attribute Selection Measures. Overview of classifier’s accuracy,


Evaluating classifier’s accuracy, Techniques for accuracy estimation, Increasing
the accuracy of classifier

● Overview of Classification ➖
Classification is a supervised learning technique in data mining and machine learning where the

IL
goal is to assign a label or category to new observations based on the training data. It involves
building a model from a set of labelled examples (training data) and then using that model to
classify new examples.

Classification Process

● Data Collection: ➖
Gather and prepare the data for analysis. This dataset includes both the features (attributes)
H
and the labels (classes).

● Data Preprocessing: ➖
Clean the data by handling missing values, removing outliers, and normalising the data.
Split the dataset into training and testing sets.

● Feature Selection: ➖
SA
Select relevant features that contribute significantly to the prediction process.

● Model Selection: ➖
Choose an appropriate classification algorithm (e.g., decision tree, SVM, k-NN).

● Model Training: ➖
Train the model using the training dataset.

● Model Evaluation: ➖
Evaluate the model using the testing dataset and various performance metrics.

● Model Deployment: ➖
Deploy the model for classifying new, unseen data.

● Model Monitoring and Maintenance: ➖


Continuously monitor and update the model to ensure its accuracy over time.

21
● Decision Tree

● Overview of Decision Tree ➖


A decision tree is a flowchart-like structure where each internal node represents a decision
based on an attribute, each branch represents an outcome of the decision, and each leaf node
represents a class label.

● Decision Tree Induction ➖


The process of building a decision tree from the training data. It involves selecting the best
attribute to split the data at each node, recursively partitioning the data, and stopping when a
stopping criterion is met.

● Attribute Selection Measures ➖

IL
Attribute selection measures (also known as splitting criteria) determine how to split the data at
each node in the decision tree. Common measures include:

● Information Gain: ➖
I. Measures the reduction in entropy or uncertainty after splitting the data based on an
attribute.
II. Formula: Information
Gain=Entropy(S)−∑v∈Values(A)∣Sv∣∣S∣×Entropy(Sv)\text{Information Gain} =
H
\text{Entropy}(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} \times
\text{Entropy}(S_v)Information Gain=Entropy(S)−∑v∈Values(A)​∣S∣∣Sv​∣​×Entropy(Sv​)
SA
22


IL
1. Holdout Method:
H
I. Split the dataset into a training set and a testing set (e.g., 70% training, 30% testing).
II. Train the model on the training set and evaluate it on the testing set.

2. Cross-Validation: ➖
I. K-Fold Cross-Validation: Divide the dataset into
II. k subsets, train the model
SA
III. k times, each time using a different subset as the testing set and the remaining as the
training set.
IV. Leave-One-Out Cross-Validation: A special case of k-fold cross-validation where
V. k is equal to the number of samples in the dataset.

3. Bootstrap Method: ➖
I. Randomly sample with replacement from the dataset to create multiple training sets and
evaluate the model on the remaining data.
II. Increasing the Accuracy of Classifier

4. Feature Engineering: ➖
I. Create new features from the existing data that can help improve the model's
performance.

5. Feature Selection: ➖
I. Remove irrelevant or redundant features to reduce noise and improve model
performance.

23
6. Parameter Tuning:
I. Optimise the hyperparameters of the model using techniques like grid search or random
search.

7. Ensemble Methods: ➖
I. Combine multiple models to improve overall performance (e.g., bagging, boosting,
stacking).

8. Data Augmentation: ➖
I. Increase the size of the training data by creating synthetic samples.

9. Regularisation: ➖
I. Add a penalty to the loss function to prevent overfitting (e.g., L1, L2 regularisation).

IL

● Introduction to Clustering, Types of clusters, Clustering methods, Data
visualisation & various data visualisation tools

● Introduction to Clustering ➖
H
Clustering is an unsupervised machine learning technique used to group a set of objects in such
a way that objects in the same group (cluster) are more similar to each other than to those in
other groups (clusters). It helps in discovering the inherent structure in data without any
predefined labels.

● Types of Clusters ➖

SA
● Exclusive Clusters (Hard Clustering):
I. Each data point belongs to exactly one cluster.
II. Example: K-Means clustering.

● Overlapping Clusters (Soft Clustering): ➖


I. A data point can belong to more than one cluster.
II. Example: Fuzzy C-Means clustering.

● Hierarchical Clusters: ➖
I. Clusters are organised into a tree-like structure.
II. Example: Agglomerative clustering.

● Density-Based Clusters: ➖
I. Clusters are formed based on the density of data points.
II. Example: DBSCAN (Density-Based Spatial Clustering of Applications with Noise).

24
● Grid-Based Clusters:
I. The data space is divided into a finite number of cells that form a grid structure.
II. Example: STING (Statistical Information Grid).

● Clustering Methods ➖
● Partitioning Methods: ➖
I. Divide the dataset into a set of k clusters.
II. Examples: K-Means, K-Medoids.

● Hierarchical Methods: ➖
I. Build a hierarchy of clusters either through a top-down (divisive) or bottom-up
(agglomerative) approach.

IL
II. Examples: Agglomerative Hierarchical Clustering, DIANA (Divisive Analysis).

● Density-Based Methods: ➖
I. Identify clusters based on the density of data points in the data space.
II. Examples: DBSCAN, OPTICS (Ordering Points To Identify the Clustering Structure).

● Grid-Based Methods: ➖
I. Quantize the data space into a finite number of cells and perform clustering on these
cells.
H
II. Examples: STING, CLIQUE (Clustering In QUEst).

● Model-Based Methods: ➖
I. Assume a model for each of the clusters and find the best fit of the data to the given
model.
SA
II. Examples: Expectation-Maximization (EM) algorithm, Gaussian Mixture Models (GMM).

● Data Visualization ➖
Data visualisation is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualisation tools provide an accessible way to
see and understand trends, outliers, and patterns in data.

● Various Data Visualization Tools ➖


● Tableau: ➖
I. A powerful, easy-to-use tool for creating interactive and shareable dashboards.
II. Supports a wide variety of charts and maps.

● Power BI: ➖
I. A business analytics service by Microsoft.
II. Provides interactive visualisations and business intelligence capabilities with an interface
simple enough for end users to create their own reports and dashboards.

25
● Matplotlib:
I. A plotting library for the Python programming language.
II. Provides an object-oriented API for embedding plots into applications.

● Ggplot2: ➖
I. A data visualisation package for the R programming language.
II. Based on the Grammar of Graphics, providing a powerful model for creating complex
plots.

● D3.js: ➖
I. A JavaScript library for producing dynamic, interactive data visualisations in web
browsers.
II. Uses HTML, SVG, and CSS.

IL
● Plotly: ➖
I. An open-source graphing library that makes interactive, publication-quality graphs online.
II. Supports multiple languages including Python, R, and JavaScript.

● QlikView: ➖
I. A business intelligence tool for data visualisation.
II. Allows users to create guided analytics applications and dashboards.
H
● Google Data Studio: ➖
I. A free tool by Google to create interactive dashboards and reports.
II. Easily integrates with other Google services like Google Analytics and Google Sheets.

Summary ➖
SA
I. Clustering: An unsupervised technique for grouping similar data points into clusters.

II. Types of Clusters: Include exclusive, overlapping, hierarchical, density-based, and


grid-based clusters.

III. Clustering Methods: Partitioning, hierarchical, density-based, grid-based, and


model-based methods.

IV. Data Visualization: The graphical representation of data to understand trends and
patterns.

V. Visualisation Tools: Include Tableau, Power BI, Matplotlib, ggplot2, D3.js, Plotly, QlikView,
and Google Data Studio.

HAPPY ENDING BY : SAHIL RAUNIYAR /


PTU-CODER !
26

SOME PREVIOUS YEARS QUESTIONS PAPER AND THEIR


SOLUTIONS OF BCA FIFTH SEMESTERS (FROM : PTU).

DWM(5th)May2018.pdf

SECTION A

IL
1. Briefly answer the following:
a) Differentiate between operational and informational data stores.
b) What is multidimensional data? Give two examples.
c) What is OLAM?
d) Define Data Mining.
e) Briefly discuss the Snowflake schema.
f) Discuss Discovery driven cube.
g) What is a Decision Tree?
H
h) How is the accuracy of a classifier measured?
i) What are the different types of data used in cluster analysis?
j) What are the parameters for selecting and using the right data mining
technique?

1. Briefly answer the following:


SA
a) Differentiate between operational and informational data stores.

1. Operational Data Stores (ODS): ➖


I. Purpose: Used for managing day-to-day operations.
II. Data Update: Frequently updated.
III. Time Span: Contains current data, often short-term.
IV. Example: Transaction processing systems, ERP systems.

2. Informational Data Stores (IDS): ➖


I. Purpose: Used for analysis and decision-making.
II. Data Update: Infrequently updated, often via batch processes.
III. Time Span: Contains historical data, often long-term.
IV. Example: Data warehouses, data marts.

27


b) What is multidimensional data? Give two examples.
1. Multidimensional Data:
I. Data structured in multiple dimensions, often used in OLAP systems for complex queries
and analysis.
II. Each dimension represents a different aspect of the data.
Examples:
I. Sales data: Dimensions could be time, geography, and product.
II. Financial data: Dimensions could be time, department, and expense type.


c) What is OLAM?
1. OLAM (Online Analytical Mining):
I. Combines Online Analytical Processing (OLAP) with data mining techniques.
II. Allows for the dynamic integration of OLAP with data mining, providing real-time analytical

IL
insights.


d) Define Data Mining.
1. Data Mining:
The process of discovering patterns, correlations, and insights from large datasets using
techniques from statistics, machine learning, and database systems.
Aims to transform raw data into useful information.
H
1. Snowflake Schema: ➖
e) Briefly discuss the Snowflake schema.

A type of data warehouse schema where the dimension tables are normalised, splitting data into
additional tables.
It reduces data redundancy but increases the complexity of queries due to the additional joins.
Example: A dimension table for location split into separate tables for country, state, and city.


SA

f) Discuss Discovery driven cube.
1. Discovery-Driven Cube:
I. A data cube that supports data exploration by automatically highlighting interesting and
significant data points.
II. Integrates visual and interactive analysis to help users discover insights without
predefined queries.
III. Facilitates hypothesis generation and validation by dynamically adjusting views based on
data interactions.


g) What is a Decision Tree?
1. Decision Tree:
A flowchart-like structure used for classification and regression tasks.
Each internal node represents a decision based on an attribute, each branch represents an
outcome of the decision, and each leaf node represents a class label or continuous value.
Example: A decision tree for weather prediction where nodes represent conditions like
temperature, humidity, and wind.
28
h) How is the accuracy of a classifier measured?
Accuracy of a Classifier:
Measured by the proportion of correctly predicted instances to the total instances.
Formula:
Accuracy
=
Number of Correct Predictions
Total Number of Predictions
Accuracy=
Total Number of Predictions
Number of Correct Predictions

IL
Other metrics: Precision, Recall, F1 Score, ROC-AUC.

i) What are the different types of data used in cluster analysis?

Types of Data in Cluster Analysis: ➖


I. Numeric Data: Continuous data, such as age, income.
II. Categorical Data: Discrete data, such as gender, occupation.
III. Binary Data: Data with two possible values, such as yes/no, true/false.
H
IV. Mixed Data: Combination of numeric, categorical, and binary data.
V. Text Data: Unstructured text data, such as documents or tweets.
VI. Spatial Data: Geographic data, such as coordinates, regions.
VII. Temporal Data: Time-related data, such as timestamps, sequences.

j) What are the parameters for selecting and using the right data mining technique?


SA
Parameters for Selecting Data Mining Technique:
I. Nature of Data: Type, structure, and quality of data (numeric, categorical, binary, etc.).
II. Goal of Analysis: Classification, clustering, regression, association, etc.
III. Data Volume: Size of the dataset and scalability requirements.
IV. Model Interpretability: Need for understandable models vs. black-box models.
V. Accuracy Requirements: Desired level of prediction accuracy.
VI. Computational Resources: Availability of processing power and memory.
VII. Time Constraints: Time available for model training and prediction.
VIII. Tool and Algorithm Availability: Access to specific data mining tools and algorithms.

SECTION B

2. Define Data Warehousing. What is the need for data warehousing? Discuss the
structure of a data warehouse.

3. What do you mean by data pre-processing? Explain the various stages in the process
of data pre-processing.
29

4. What is OLAP? Discuss the architecture of OLAP in detail.

5. Explain Association Rule Mining. What are the various algorithms for generating
association rules? Discuss with examples.

6. Discuss Bootstrapping, Boosting and Bagging with examples.

7. What is Clustering? Discuss the various clustering algorithms.

Answers ➖

IL
SECTION B

2. Define Data Warehousing. What is the need for data warehousing? Discuss the structure of a data
warehouse.

Data Warehousing: A data warehouse is a centralized repository that stores integrated data from
multiple heterogeneous sources. It is designed to facilitate reporting, analysis, and
decision-making processes by providing a consolidated view of an organization's data.
H
Need for Data Warehousing:

● Integration of Data: Combines data from different sources into a unified format, making it
easier to analyze and report.
● Historical Analysis: Maintains historical data for trend analysis and long-term business
strategy planning.
SA
● Improved Query Performance: Optimizes query performance through indexing,
summarization, and denormalization.
● Decision Support: Enhances decision-making by providing accurate, consistent, and
comprehensive data.
● Data Quality and Consistency: Ensures data quality, accuracy, and consistency across
the organization.
● Support for OLAP and Data Mining: Facilitates complex queries and data mining
operations.

Structure of a Data Warehouse:

1. Data Sources:
○ Operational databases, external data sources, flat files, etc.
○ ETL (Extract, Transform, Load) process extracts data from these sources.
2. ETL Process:
○ Extraction: Data is extracted from various sources.
○ Transformation: Data is cleaned, transformed, and standardized.
○ Loading: Transformed data is loaded into the data warehouse.
3. Data Staging Area:
30
○ Temporary storage area where data is processed and transformed before loading
into the data warehouse.
4. Data Storage:
○ Centralized Repository: Stores integrated and transformed data.
○ Data Marts: Subsets of the data warehouse tailored for specific departments or
business areas.
○ Metadata Repository: Stores information about data sources, data
transformations, data schemas, etc.
5. Data Access Layer:
○ Provides tools and interfaces for querying, reporting, and analyzing the data.
○ Includes OLAP tools, data mining tools, and business intelligence tools.
6. Front-End Tools:
○ Dashboards, reporting tools, and data visualization tools used by end-users to

IL
interact with the data warehouse.

3. What do you mean by data pre-processing? Explain the various stages in the process
of data pre-processing.

Data Pre-Processing: Data pre-processing is the process of transforming raw data into a clean
and usable format, preparing it for further analysis or modeling. It is a crucial step to ensure data
quality and improve the performance of data mining algorithms.

1. Data Cleaning:
H
Stages in Data Pre-Processing:

○ Handling Missing Values: Imputation, deletion, or estimation of missing data.


○ Noise Reduction: Smoothing techniques, binning, regression, clustering to reduce
noise.
○ Outlier Detection and Removal: Identifying and removing outliers that can skew
SA
analysis.
2. Data Integration:
○ Combining Data: Merging data from multiple sources into a single coherent
dataset.
○ Schema Integration: Resolving schema conflicts and ensuring consistency in data
formats.
3. Data Transformation:
○ Normalization: Scaling data to a standard range (e.g., min-max normalization,
z-score normalization).
○ Discretization: Converting continuous data into discrete intervals.
○ Aggregation: Summarizing data, often by grouping or rolling up data along
dimensions.
○ Encoding: Converting categorical data into numerical formats (e.g., one-hot
encoding).
4. Data Reduction:
○ Feature Selection: Identifying and retaining the most relevant features.
○ Feature Extraction: Creating new features from existing ones (e.g., PCA).
○ Sampling: Reducing the size of the dataset by selecting a representative subset.
31
5. Data Discretization and Binning:
○ Discretization: Transforming continuous attributes into categorical ones.
○ Binning: Grouping continuous data into bins or intervals.

4. What is OLAP? Discuss the architecture of OLAP in detail.

OLAP (Online Analytical Processing): OLAP is a category of software tools that provides
analysis of data stored in a database. It enables users to perform complex queries and analysis,
often involving large amounts of data, with the purpose of discovering insights and making
business decisions.

Architecture of OLAP:

1. OLAP Cube:

IL
○ A multi-dimensional array of data, which allows data to be modeled and viewed in
multiple dimensions.
2. OLAP Servers:
○ ROLAP (Relational OLAP): Stores data in relational databases and uses complex
SQL queries for data analysis.
○ MOLAP (Multidimensional OLAP): Stores data in multi-dimensional cube
structures and pre-computes aggregates for fast querying.
○ HOLAP (Hybrid OLAP): Combines ROLAP and MOLAP, storing detailed data in

3. Data Sources:
H
relational databases and aggregates in multi-dimensional cubes.

○ Includes operational databases, data warehouses, and external data sources from
which data is extracted.
4. ETL Process:
○ Extracts data from various sources, transforms it into a suitable format, and loads it
into OLAP servers.
SA
5. OLAP Engine:
○ The core processing engine that handles complex queries, performs computations,
and retrieves data from OLAP cubes or relational databases.
6. OLAP Tools:
○ Query and Reporting Tools: Allow users to create and execute queries, generate
reports, and visualize data.
○ Analysis Tools: Provide functionalities for drill-down, roll-up, slicing, dicing, and
pivoting.
7. User Interface:
○ The front-end interface through which users interact with the OLAP system, often
providing graphical and interactive capabilities.

5. Explain Association Rule Mining. What are the various algorithms for generating
association rules? Discuss with examples.

Association Rule Mining: Association rule mining is a technique in data mining that discovers
interesting relationships, patterns, and associations among a set of items in large datasets,
typically transactional databases.
32
Key Concepts:

● Support: Indicates how frequently an itemset appears in the dataset.


● Confidence: Measures the likelihood that a rule is correct.
● Lift: Measures how much more likely the rule is to occur than random chance.

Algorithms for Generating Association Rules:

1. Apriori Algorithm:
○ Generates candidate itemsets and prunes infrequent ones using support threshold.
○ Example: In a supermarket dataset, finding that customers who buy bread also buy
butter with a certain support and confidence.
2. FP-Growth (Frequent Pattern Growth):
○ Uses a divide-and-conquer strategy and constructs an FP-tree to find frequent

IL
itemsets without candidate generation.
○ Example: Identifying frequently co-purchased items in an e-commerce platform
using a compact FP-tree structure.
3. Eclat Algorithm:
○ Uses a depth-first search approach and vertical data format (transaction IDs) for
discovering frequent itemsets.
○ Example: Mining frequent itemsets in text data where items are words and
transactions are documents.
H
6. Discuss Bootstrapping, Boosting and Bagging with examples.

Bootstrapping:

● A statistical resampling method used to estimate the distribution of a statistic by sampling


with replacement from the original dataset.
SA
● Example: Estimating the confidence interval of the mean of a sample by repeatedly
resampling and calculating the mean of each resample.

Boosting:

● An ensemble learning technique that combines multiple weak classifiers to create a


strong classifier.
● Example: AdaBoost algorithm, which adjusts the weights of incorrectly classified
instances and combines the predictions of weak classifiers to improve accuracy.

Bagging (Bootstrap Aggregating):

● An ensemble method that trains multiple models on different bootstrap samples and
aggregates their predictions.
● Example: Random Forest, which builds multiple decision trees on different subsets of the
data and averages their predictions.

7. What is Clustering? Discuss the various clustering algorithms.


33
Clustering: Clustering is an unsupervised learning technique that groups a set of objects in
such a way that objects in the same group (cluster) are more similar to each other than to those
in other groups.

Various Clustering Algorithms:

1. K-Means Clustering:
○ Partitions the data into K clusters by minimizing the sum of squared distances
between data points and the cluster centroids.
○ Example: Segmenting customers based on purchasing behavior.
2. Hierarchical Clustering:
○ Builds a hierarchy of clusters either through a bottom-up (agglomerative) or
top-down (divisive) approach.

IL
○ Example: Creating a dendrogram to visualize the hierarchical relationships
between clusters.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
○ Forms clusters based on the density of data points, identifying core points, border
points, and noise.
○ Example: Identifying clusters in geographical data where clusters are regions of
high data density.
4. Gaussian Mixture Models (GMM):
H
○ Assumes that the data is generated from a mixture of several Gaussian
distributions and uses the EM algorithm to estimate the parameters.
○ Example: Clustering data points that follow a normal distribution into multiple
Gaussian clusters.
5. Agglomerative Hierarchical Clustering:
○ Recursively merges the closest pair of clusters until all points are in a single cluster.
○ Example: Clustering gene expression data to find
SA
34

DWM(5th)May2019.pdf

SECTION-A
Q1. Answer briefly :
a) What are operational and informational data stores?
b) What is OLAP data warehouse?
c) Write about the importance of Artificial Intelligence.
d) What are data cubes?
e) What is data summarization?
f) How does decision tree work?

IL
g) Write about market based analysis.
h) What is the accuracy of classifier?
i) What is cross validation?
j) Why do we use data visualization?

Answers ➖ H
SECTION A

Q1. Answer briefly:

a) What are operational and informational data stores?

● Operational Data Stores (ODS):


SA
○ Purpose: Used for managing day-to-day operations.
○ Data Update: Frequently updated.
○ Time Span: Contains current, short-term data.
○ Example: Transaction processing systems, ERP systems.
● Informational Data Stores (IDS):
○ Purpose: Used for analysis and decision-making.
○ Data Update: Infrequently updated, often via batch processes.
○ Time Span: Contains historical, long-term data.
○ Example: Data warehouses, data marts.

b) What is OLAP data warehouse?

● An OLAP (Online Analytical Processing) Data Warehouse is a type of data warehouse


optimized for complex queries and analysis. It allows users to perform multidimensional
analysis on large volumes of data, supporting operations like slicing, dicing, drill-down,
and roll-up. OLAP systems enable fast and interactive querying, often through the use of
multidimensional data structures called data cubes.

c) Write about the importance of Artificial Intelligence.


35
● Artificial Intelligence (AI):
○ Automation: AI enables automation of repetitive tasks, increasing efficiency and
reducing human error.
○ Decision Making: AI systems analyze vast amounts of data to provide insights and
support decision-making processes.
○ Personalization: AI tailors experiences and services to individual preferences,
enhancing user engagement.
○ Problem Solving: AI can solve complex problems that are difficult for humans to
handle, such as image and speech recognition.
○ Innovation: AI drives innovation by enabling the development of new technologies
and applications across various industries.

d) What are data cubes?

IL
● Data Cubes:
○ Multidimensional arrays of data used in OLAP systems.
○ Organize data into dimensions (e.g., time, location, product) and measures (e.g.,
sales, profit).
○ Facilitate complex queries and data analysis by providing a structured and intuitive
way to navigate and summarize large datasets.

e) What is data summarization?


H
● Data Summarization:
○ The process of reducing detailed data into a compact and comprehensive form.
○ Techniques include aggregation (e.g., summing sales over time periods),
descriptive statistics (e.g., mean, median), and data visualization (e.g., charts,
graphs).
○ Helps in understanding and interpreting large datasets by highlighting key patterns
SA
and trends.

f) How does decision tree work?

● Decision Tree:
○ A flowchart-like structure used for classification and regression tasks.
○ Nodes: Represent decisions or tests based on an attribute.
○ Branches: Represent outcomes of the decision or test.
○ Leaf Nodes: Represent class labels or continuous values.
○ The tree is built by recursively splitting the dataset based on attribute values that
maximize the separation of classes (using measures like Gini impurity, entropy, or
information gain).

g) Write about market based analysis.

● Market-Based Analysis:
○ Also known as market basket analysis, it is a data mining technique used to
understand the purchase behavior of customers.
○ Identifies patterns and associations between items purchased together.
36
○ Example: Discovering that customers who buy bread also frequently buy butter.
○ Utilized in retail for cross-selling, product placement, and inventory management.

h) What is the accuracy of classifier?

● Accuracy of a Classifier:
○ The proportion of correctly predicted instances to the total instances.
○ Formula: Accuracy=Number of Correct PredictionsTotal Number of
Predictions\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total
Number of Predictions}}Accuracy=Total Number of PredictionsNumber of Correct
Predictions​
○ Accuracy is a measure of the classifier's performance, indicating how well it
predicts the correct labels.

IL
i) What is cross validation?

● Cross Validation:
○ A technique for evaluating the performance of a model by partitioning the data into
training and testing sets multiple times.
○ K-Fold Cross Validation: The dataset is divided into k subsets (folds). The model
is trained on k-1 folds and tested on the remaining fold. This process is repeated k
times, with each fold used as the test set once.
H
○ Helps in assessing the model's generalizability and reducing overfitting.

j) Why do we use data visualization?

● Data Visualization:
○ Transforms data into graphical representations, such as charts, graphs, and maps.
○ Importance:
SA
■ Insight Discovery: Reveals patterns, trends, and correlations that may not
be apparent in raw data.
■ Communication: Simplifies complex data, making it easier to understand
and communicate findings.
■ Decision Making: Supports data-driven decision making by presenting
information clearly and intuitively.
■ Exploration: Allows interactive exploration of data, enabling users to drill
down into details and uncover insights.
37

SECTION-B

Q2 How much does a data warehouse cost? Write their applications and uses.
Q3 Discuss the steps of building data warehouse by considering various technical
aspects.
Q4 What is multidimensional data model? Discuss the schemas for multidimensional
data.
Q5 What is association rule mining? Explain Apriori algorithm in data mining.
Q6 Define clustering? Why clustering is important in Data Mining? Write its uses.
Q7 What are different types of Data Mining Techniques? Explain any one in detail?.

Answers ➖

IL
SECTION B

Q2. How much does a data warehouse cost? Write their applications and uses.

Cost of a Data Warehouse:

● The cost of a data warehouse can vary significantly based on several factors:

equipment.
H
○ Infrastructure: Hardware costs for servers, storage devices, and networking

○ Software: Costs for database management systems (DBMS), ETL tools, and
analytics software.
○ Development: Costs associated with data modeling, ETL development, and
integration.
○ Maintenance: Ongoing costs for data cleansing, monitoring, and support.
SA
Applications and Uses:

● Business Intelligence: Provides a consolidated view of data for reporting and analytics.
● Decision Support: Supports strategic decision-making based on historical and real-time
data.
● Predictive Analytics: Enables forecasting and predictive modeling using advanced
analytics techniques.
● Operational Efficiency: Improves efficiency by streamlining data access and analysis.
● Customer Insights: Helps in understanding customer behavior, preferences, and trends.
● Regulatory Compliance: Facilitates compliance reporting and audit trails.
● Risk Management: Identifies and mitigates risks through data-driven insights.
● Marketing and Sales: Enhances targeting, campaign effectiveness, and sales
forecasting.

Q3. Discuss the steps of building a data warehouse by considering various technical aspects.

Steps of Building a Data Warehouse:

1. Requirement Gathering and Analysis:


38
○ Define business requirements and objectives for the data warehouse.
○ Identify key stakeholders and user needs.
2. Data Source Identification:
○ Identify and analyze data sources (operational databases, external systems, flat
files).
○ Evaluate data quality and suitability for the data warehouse.
3. Data Modeling:
○ Design the conceptual, logical, and physical data models.
○ Define entities, attributes, relationships, and data granularity.
4. ETL (Extract, Transform, Load) Process:
○ Extract data from source systems using ETL tools.
○ Transform data to conform to the data warehouse schema (cleaning, filtering,
integrating).

IL
○ Load transformed data into the data warehouse.
5. Data Storage and Organization:
○ Choose appropriate storage structures (tables, indexes, partitions).
○ Implement data partitioning, clustering, and indexing for performance optimization.
6. Metadata Management:
○ Develop and maintain metadata repository.
○ Document data lineage, transformations, and data definitions.
7. OLAP Cube Design:

8.
H
○ Design and build OLAP cubes for multidimensional analysis.
○ Define dimensions, measures, hierarchies, and aggregation levels.
Implementation and Testing:
○ Deploy data warehouse components (database, ETL processes, OLAP cubes).
○ Conduct unit testing, integration testing, and performance testing.
9. Deployment and Maintenance:
SA
○ Deploy the data warehouse to production environment.
○ Establish monitoring and maintenance processes for data quality, performance
tuning, and backup/recovery.

Q4. What is a multidimensional data model? Discuss the schemas for multidimensional data.

Multidimensional Data Model:

● A multidimensional data model organizes data into multiple dimensions, providing a


structured view to analyze and explore data along different perspectives.

Schemas for Multidimensional Data:

1. Star Schema:
○ Central fact table surrounded by dimension tables.
○ Simple and denormalized structure, suitable for querying and reporting.
○ Example: Sales fact table linked to product, time, and location dimensions.
2. Snowflake Schema:
○ Extends the star schema by normalizing dimension tables.
○ Reduces data redundancy but increases complexity due to additional joins.
39
○ Example: Product dimension further normalized into product category and
subcategory tables.
3. Fact Constellation (Galaxy) Schema:
○ Multiple fact tables share dimension tables.
○ Suitable for complex business processes with multiple interrelated fact tables.
○ Example: Separate fact tables for sales and inventory linked to common product
and time dimensions.

Q5. What is association rule mining? Explain the Apriori algorithm in data mining.

Association Rule Mining:

● Association rule mining is a data mining technique used to discover interesting


relationships and associations between items in large datasets, typically transactional

IL
databases.

Apriori Algorithm:

● Algorithm Steps:
○ Generate Candidate Itemsets: Identify frequent items (single items) and combine
them to form candidate itemsets.
○ Calculate Support: Count the occurrences of each candidate itemset in the
dataset.
H
○ Prune Non-Frequent Itemsets: Remove candidate itemsets that do not meet the
minimum support threshold.
○ Generate Association Rules: From frequent itemsets, generate rules with a
minimum confidence threshold.
● Example: In a supermarket dataset:
○ Support: Percentage of transactions containing a specific itemset.
SA
○ Confidence: Likelihood of one item being purchased given the purchase of another
item.

Q6. Define clustering? Why clustering is important in Data Mining? Write its uses.

Clustering:

● Clustering is an unsupervised learning technique that groups a set of objects in such a


way that objects in the same group (cluster) are more similar to each other than to those
in other groups.

Importance of Clustering in Data Mining:

● Pattern Recognition: Identifies inherent patterns and structures in data.


● Data Understanding: Provides insights into data distribution and relationships.
● Anomaly Detection: Identifies outliers or unusual data points.
● Data Compression: Reduces data dimensionality for easier analysis.
● Segmentation: Segments data into meaningful groups for targeted analysis.
40
Uses of Clustering:

● Market Segmentation: Grouping customers based on buying behavior.


● Image Segmentation: Identifying regions of interest in medical or satellite images.
● Document Clustering: Organizing documents based on content similarity.
● Anomaly Detection: Identifying unusual patterns in network traffic or financial
transactions.
● Genetics: Clustering genes based on expression patterns for biological research.

Q7. What are different types of Data Mining Techniques? Explain any one in detail.

Different Types of Data Mining Techniques:

1. Classification: Predicts categorical labels or classifies data into predefined classes.

IL
2. Regression: Predicts continuous values or numeric outcomes.
3. Clustering: Groups similar data points into clusters based on similarity.
4. Association Rule Mining: Finds interesting relationships between variables in large
datasets.
5. Anomaly Detection: Identifies outliers or unusual patterns in data.
6. Sequential Pattern Mining: Discovers sequential patterns or trends in data sequences.
7. Text Mining: Extracts meaningful patterns and relationships from unstructured text data.

Explanation - Classification:
H
● Definition: Classification is a supervised learning technique used to predict categorical
labels or class memberships for new data points based on past observations.
● Process:
○ Training Phase: Learn patterns and relationships from labeled training data using
algorithms like Decision Trees, Naive Bayes, or Support Vector Machines.
SA
○ Prediction Phase: Apply the learned model to new data to predict class labels.
● Example:
○ Application: Email Spam Detection.
○ Process: Train a classifier using labelled emails (spam or not spam). The classifier
learns patterns (e.g., keywords, sender) associated with spam emails. When a new
email arrives, the classifier predicts whether it is spam or not based on learned
patterns.

You might also like