THEORY FILE - Data Warehouse and Mining (5th Sem) .
THEORY FILE - Data Warehouse and Mining (5th Sem) .
IL
Sahil Kumar Prof.
Program ➖➖
BCA
UNIT ➖01
➖
● Need for strategic information, difference between operational and Informational
data stores
Strategic Information ➖
Strategic information is vital for organisations as it supports long-term planning and
IL
decision-making processes. It helps executives and senior managers to understand the
competitive landscape, market trends, and internal performance metrics, thereby guiding the
organisation towards achieving its strategic goals.
➖
Difference Between Operational and Informational Data Stores in Data Warehousing and
Data Mining
SA
Data Warehousing involves the storage, retrieval, and management of large volumes of data
from different sources to support business analysis and reporting. Data Mining involves
analysing large datasets to find patterns, correlations, and insights for decision-making.
Characteristics: ➖
I. Real-Time Data: Provides up-to-date information.
II. Frequent Updates: Data is continuously updated as transactions occur.
III. Transaction-Oriented: Focuses on current operations and transactions.
IV. Normalisation: Data is highly normalised to eliminate redundancy and ensure data
integrity.
Uses: ➖
I. Order Processing: Managing customer orders.
2
II. Inventory Management: Tracking stock levels.
III. Customer Relationship Management (CRM): Managing customer interactions.
Characteristics:
I. Historical Data: Contains historical data for trend analysis.
II. Periodic Updates: Data is updated periodically (e.g., daily, weekly).
III. Analysis-Oriented: Supports complex queries and data analysis.
IV. De-normalization: Data is often de-normalized to optimise query performance.
IL
Uses:
I. Business Intelligence: Generating reports and dashboards.
II. Data Mining: Discovering patterns and insights.
III. Strategic Planning: Supporting long-term decision-making.
Conclusion
Understanding the differences between operational and informational data stores is crucial for
H
effectively leveraging data warehousing and mining techniques. While operational data stores
support the daily transactional processes, informational data stores provide the historical and
comprehensive view necessary for strategic decision-making and analysis.
➖
Approaches to build a data warehouse, Building a data warehouse, Metadata & its
types
➖
➖
Role and Structure of a Data Warehouse
Role:
I. Decision Support: Facilitates informed decision-making by providing comprehensive data
analysis.
3
II. Business Intelligence: Supports reporting, online analytical processing (OLAP), and data
mining.
III. Data Consolidation: Integrates disparate data sources into a unified view.
Structure: ➖
I. Source Systems: Operational databases, external data sources.
II. Staging Area: Intermediate storage for ETL (Extract, Transform, Load) processes.
III. Data Storage: Central repository (data warehouse) organised in schemas like star,
snowflake.
IV. Data Presentation: Access layer for reporting and analysis tools.
OLAP Operations ➖
I. Roll-Up: Aggregating data along a dimension.
IL
II. Drill-Down: Breaking data into finer levels of detail.
III. Slice: Selecting a single dimension to create a subset.
IV. Dice: Creating a subcube by selecting multiple dimensions.
V. Pivot: Rotating data axes to view from different perspectives.
Data Mart ➖
A data mart is a subset of a data warehouse, focused on a specific business area or
department. It is designed for a particular group of users, offering tailored data and faster query
performance.
H
Differences Between Data Mart and Data Warehouse
Scope:
Data Mart: Specific to a business area.
SA
Data Warehouse: Enterprise-wide.
Size:
Data Mart: Smaller, more focused datasets.
Data Warehouse: Larger, comprehensive datasets.
Users:
Data Mart: Specific departmental users.
Data Warehouse: Broad range of users across the organization.
Implementation Time:
Data Mart: Quicker to implement.
Data Warehouse: Takes longer to build due to broader scope.
Top-Down Approach: ➖
I. Starts with designing the overall data warehouse.
II. Later builds data marts from the centralised data warehouse.
➖
4
Bottom-Up Approach:
I. Begins with building data marts for specific business areas.
II. Integrates these data marts into an enterprise-wide data warehouse over time.
Hybrid Approach: ➖
I. Combines elements of both top-down and bottom-up approaches.
II. Building a Data Warehouse
Requirements Analysis: ➖
➖
I. Understand business needs and data requirements.
Data Modelling:
I. Design conceptual, logical, and physical data models.
IL
ETL Process:
Extract data from source systems, transform it into a suitable format, and load it into the data
warehouse.
Implementation:
H
I. Set up the data warehouse environment and load data.
Business Metadata: ➖
I. Describes the data from a business perspective (e.g., definitions, business rules).
Technical Metadata:
I. Describes technical aspects (e.g., data types, ETL processes, table structures).
Operational Metadata:
I. Contains information about system operations (e.g., data lineage, ETL job logs, error
reports).
Process Metadata:
Details about the processes that manipulate data (e.g., ETL transformations, scheduling
information).
5
● Data warehouse definition, characteristics, Data warehouse role and structure,
OLAP Operations, Data mart, Different between data mart and data warehouse,
➖
Approaches to build a data warehouse, Building a data warehouse, Metadata & its
types
IL
rather than specific business processes.
II. Integrated: Consolidates data from various sources into a consistent format, with uniform
naming conventions, measurements, and encoding structures.
III. Non-Volatile: Once data is entered into the data warehouse, it is not updated or deleted.
This ensures a stable historical record.
IV.
H
Time-Variant: Contains historical data to track changes over time, often with time stamps.
➖
SA
Structure of a Data Warehouse
I. Data Sources: Various operational systems, external data sources, and other data
repositories.
II. ETL Processes: Extract, Transform, Load processes that gather data from sources, clean
and transform it, and load it into the warehouse.
III. Data Storage: Centralised storage area where data is organised into schemas like star or
snowflake schemas.
IV. Metadata: Data about the data, including definitions, mappings, and data lineage.
V. Access Tools: Query and reporting tools, OLAP (Online Analytical Processing) tools, and
data mining tools.
OLAP Operations ➖
OLAP tools allow users to interactively analyse multidimensional data from multiple
perspectives.
➖
6
Common OLAP operations include:
I. Slice: Isolating a single layer of a data cube.
II. Dice: Creating a sub-cube by selecting specific values for multiple dimensions.
III. Drill Down/Up: Navigating through the levels of data, from summary to detailed data and
vice versa.
IV. Pivot (Rotate): Reorienting the multidimensional view of data to look at it from different
perspectives.
Data Mart ➖
A data mart is a subset of a data warehouse focused on a specific business area, department,
IL
or subject. It provides a more limited, subject-specific view of data and is often easier and faster
to implement than a full data warehouse.
II. Data Integration: Data warehouses integrate data from multiple sources, whereas data
marts may focus on a single source or a few sources.
III.
IV.
H
Size: Data warehouses are typically much larger in size than data marts.
Implementation Time: Data marts can be implemented more quickly compared to data
warehouses.
➖
SA
Approaches to Build a Data Warehouse
I. Top-Down Approach: Start with a comprehensive data warehouse design and
implementation, followed by the creation of data marts.
II. Bottom-Up Approach: Start with the implementation of data marts, which are then
integrated into a comprehensive data warehouse.
III. Hybrid Approach: Combines elements of both top-down and bottom-up approaches,
implementing core components of a data warehouse along with subject-specific data
marts.
II. Data Modeling: Design the data warehouse schema, choosing between star, snowflake,
or galaxy schemas.
III. ETL Development: Create ETL processes to extract data from source systems, transform
it, and load it into the warehouse.
7
IV. Data Loading: Load the historical data into the warehouse.
V. Metadata Management: Develop and maintain metadata to ensure data consistency and
transparency.
VI. Deployment: Implement the data warehouse and make it accessible to end-users.
VII. Maintenance and Evolution: Regularly update and maintain the warehouse to
accommodate changing business needs.
IL
Types of metadata include: ➖
I. Technical Metadata: Details about data sources, data structures, data transformations,
and storage (e.g., schema definitions, data lineage).
II. Business Metadata: Describes the meaning and context of data from a business
perspective (e.g., business terms, data ownership).
H
Operational Metadata: Information about the operations performed on the data, including ETL
processes and data quality metrics (e.g., job logs, error logs).
SA
8
UNIT ➖02
● Data Pre-processing: Need, Data Summarization, Methods. Denormalization,
Multidimensional data model, Schemas for multidimensional data (Star schema,
Snowflake Schema, Fact Constellation Schema, Difference between different
schemas) .
Data Pre-processing ➖
● Need for Data Pre-processing
IL
Data pre-processing is a crucial step in the data mining process because real-world data is
often incomplete, noisy, and inconsistent. Pre-processing transforms raw data into an
understandable format.
II. Enhancing Analysis: Makes data suitable for analysis, ensuring accurate and reliable
III.
results.
H
Data Integration: Combines data from multiple sources to provide a unified view.
IV. Reducing Complexity: Simplifies data to reduce computational requirements and improve
performance.
➖
SA
● Data Summarization
Data summarization involves creating compact representations of data sets.
II. Data Cube Aggregation: Aggregating data across multiple dimensions to create summary
data cubes.
III. Data Visualization: Graphical representations like histograms, pie charts, and scatter plots
to summarise data visually.
IV. Data Reduction: Techniques like dimensionality reduction and data compression to create
smaller representations of the data.
➖
9
● Methods of Data Pre-processing
I. Data Cleaning: Handling missing values, removing noise, correcting inconsistencies.
A. Techniques: Imputation, smoothing, interpolation, and outlier detection.
II. Data Integration: Combining data from multiple sources to create a coherent data set.
B. Techniques: Schema integration, entity identification, and redundancy elimination.
IV. Data Reduction: Reducing the volume of data while maintaining its integrity.
D. Techniques: Principal Component Analysis (PCA), sampling, clustering, and aggregation.
IL
● Denormalization
Denormalization is the process of combining normalised tables into fewer tables to improve
query performance. It introduces redundancy to reduce the number of joins required in queries,
which can significantly speed up data retrieval. Denormalization is commonly used in OLAP
systems to enhance performance.
geography, product).
H
dimensions. It allows for complex queries and analysis across different dimensions (e.g., time,
III. Data Cubes: Multidimensional arrays of values, used to represent data along different
dimensions.
● Snowflake Schema ➖
I. Structure: Similar to star schema but with normalised dimension tables.
II. Characteristics: Reduces redundancy, more complex queries, slower performance
compared to star schema.
10
III. Fact Table: Contains measures and foreign keys to dimension tables.
IV. Dimension Tables: Can be normalised into multiple related tables.
II. Characteristics: Suitable for complex applications, can represent multiple business
processes.
III. Fact Tables: Multiple fact tables related to different processes or subjects.
IL
● Differences Between Schemas
I. Complexity: Star schema is the simplest, followed by snowflake schema, and then fact
constellation schema.
II. Performance: Star schema usually offers the best performance due to fewer joins, while
snowflake schema can be slower due to normalisation. Fact constellations can be
complex to query.
III.
H
Redundancy: Snowflake schema reduces redundancy compared to star schema by
normalising dimension tables. Fact constellation shares dimensions across fact tables to
reduce redundancy.
IV. Use Cases: Star schema is often used for simple queries and reporting, snowflake
schema for data warehousing with normalised data, and fact constellation for complex
SA
analytical applications involving multiple business processes.
➖
● Data warehouse architecture, OLAP servers, Indexing OLAP Data, OLAP query
processing, Data cube computation
II. Metadata Repository: Stores metadata about the data, including schema definitions, data
lineage, and transformation rules.
II. OLAP Cubes: Pre-aggregated, multidimensional views of data for fast querying.
IL
Data Access Layer:
I. Query and Reporting Tools: BI tools for generating reports and dashboards.
III. Data Mining Tools: Tools for discovering patterns and insights.
● OLAP Servers ➖
H
OLAP (Online Analytical Processing) servers support complex analytical queries and
multidimensional data analysis.
IL
IV. Disadvantages: Less efficient for high cardinality columns.
2. B-Tree Indexes: ➖
I. Usage: Suitable for high-cardinality columns.
II. Structure: Balanced tree structure for indexing data.
III. Advantages: Efficient range queries and sorting.
IV. Disadvantages: Slower performance for AND/OR operations compared to bitmap
indexes.
3. Join Indexes:
H
➖
I. Usage: Pre-computes join operations between fact and dimension tables.
II. Structure: Stores mappings between the rows of joined tables.
SA
III. Advantages: Speeds up join operations, reducing query execution time.
IV. Disadvantages: Increased storage requirements.
III. Query Execution: Performing the query operations, such as aggregations, filtering, and
sorting.
IV. Result Presentation: Displaying the query results in a user-friendly format, such as reports
or dashboards.
13
II. Roll-Up: Aggregating data along a dimension hierarchy (e.g., daily to monthly sales).
III. Drill-Down: Breaking down aggregated data into finer granularity (e.g., monthly to daily
sales).
IL
IV. Slicing: Extracting a single layer of the data cube (e.g., sales for a specific region).
V. Dicing: Creating a sub-cube by selecting specific values for multiple dimensions (e.g.,
sales for a specific region and time period).
II.
H
Materialisation: Pre-computing and storing aggregated values for fast retrieval. Can
be full (all possible aggregates) or partial (only frequently used aggregates).
III. Multidimensional Indexing: Using indexes to speed up access to specific slices or dice
SA
of the data cube.
14
UNIT ➖ 03
➖
● Data Mining: Definition, Data Mining process, Data mining methodology, Data
mining tasks, Mining various Data types & issues.
Data Mining ➖
● Definition ➖
Data mining is the process of discovering patterns, correlations, trends, and anomalies within
large sets of data through the use of various techniques such as machine learning, statistics,
IL
and database systems. It aims to transform raw data into meaningful information for
decision-making purposes.
II.
H
Formulate data mining goals based on the business objectives.
2. Data Understanding: ➖
I. Collect initial data and familiarise yourself with it.
➖
SA
3.Data Preparation:
I. Cleanse the data by handling missing values, outliers, and inconsistencies.
II. Transform data into suitable formats for mining (e.g., normalisation, discretization).
III. Integrate data from multiple sources.
4. Modelling: ➖
I. Select appropriate data mining techniques (e.g., classification, clustering, association).
II. Build and calibrate models based on the prepared data.
III. Evaluate models for accuracy and performance.
5. Evaluation: ➖
I. Assess the models to ensure they meet business objectives.
II. Compare results with expectations and refine the models as needed.
6. Deployment: ➖
I. Implement the data mining results into the decision-making process.
II. Monitor and maintain the models over time to ensure they remain effective.
➖
15
● Data Mining Methodology
IL
I. Assigning items to predefined categories or classes (e.g., spam detection in emails).
2. Regression:
I. Predicting a continuous numeric value based on input features (e.g., predicting house
prices).
3. Clustering:
I. Grouping similar items together based on their features (e.g., customer segmentation).
H
4. Association Rule Mining: ➖
I. Discovering interesting relationships between variables in large databases (e.g., market
basket analysis).
5. Anomaly Detection: ➖
SA
I. Identifying unusual or rare items that do not conform to expected patterns (e.g., fraud
detection).
IL
neighbours.
II. Applications: Spam detection, medical diagnosis, credit scoring.
2. Regression ➖
I. Techniques: Linear regression, polynomial regression, logistic regression.
II. Applications: Predicting prices, demand forecasting, risk assessment.
3. Clustering ➖
H
I. Techniques: K-means, hierarchical clustering, DBSCAN.
II. Applications: Market segmentation, image segmentation, anomaly detection.
➖
SA
5. Anomaly Detection
I. Techniques: Isolation forests, one-class SVM, statistical methods.
II. Applications: Fraud detection, network security, fault detection.
➖
● Attribute-Oriented Induction, Association rule mining, Frequent itemset mining,
The Apriori Algorithm, Mining multilevel association rules
● Attribute-Oriented Induction ➖
Attribute-oriented induction is a data mining technique that generalises data by abstracting
low-level data into higher-level concepts. This method is particularly useful for simplifying large
datasets and uncovering meaningful patterns.
Process: ➖
I. Data Collection: Gather relevant data from various sources.
II. Attribute Generalisation: Replace specific values of attributes with more general concepts
IL
based on a concept hierarchy.
III. Generalization Control: Determine the level of abstraction by specifying thresholds or
stopping criteria.
IV. Attribute Reduction: Eliminate irrelevant or less significant attributes to focus on key
aspects of the data.
V. Rule Generation: Derive general rules and patterns from the generalized data.
Applications: ➖
H
I. Summarising large datasets
II. Knowledge discovery in databases
III. Data reduction and preprocessing for other data mining tasks
Concepts: ➖
I. Support: The frequency of an itemset in the dataset. Support(X) = (Number of
transactions containing X) / (Total number of transactions)
II. Confidence: The likelihood that the presence of one item leads to the presence of
another. Confidence(X -> Y) = Support(X ∪ Y) / Support(X)
III. Lift: Measures the strength of an association rule compared to the random occurrence of
the items. Lift(X -> Y) = Confidence(X -> Y) / Support(Y)
Key Steps: ➖
I. Identify frequent itemsets: Find all itemsets with support above a specified threshold.
18
II. Generate strong association rules: Use the frequent itemsets to generate rules that meet
minimum support and confidence thresholds.
Steps: ➖
I. Generate Candidate Itemsets: Start with single items and iteratively generate larger
itemsets.
II. Prune Infrequent Itemsets: Remove candidate itemsets that do not meet the minimum
support threshold.
III. Repeat: Continue generating and pruning candidate itemsets until no more frequent
IL
itemsets can be found.
IV. Generate Rules: Use the frequent itemsets to generate association rules.
Pseudocode:
➖
sql
Ex
Ck: Candidate itemset of size k
Lk: Frequent itemset of size k
L1 = {frequent items};
H
for (k = 1; Lk != ∅; k++) {
Ck+1 = candidates generated from Lk;
for each transaction t in database {
increment the count of all candidates in Ck+1 that are contained in t
SA
}
Lk+1 = candidates in Ck+1 with min_support
}
return ∪k Lk;
Process: ➖
I. Define Concept Hierarchies: Establish hierarchies for items (e.g., "electronics" ->
"computers" -> "laptops").
II. Mine Frequent Itemsets at Each Level: Start from the lowest level and mine frequent
itemsets.
III. Generate Rules: Generate association rules at each level of abstraction.
IV. Prune and Refine: Apply thresholds for support and confidence at different levels to refine
the rules.
➖
19
Example:
I. Low-Level Rule: {HP Laptop, Wireless Mouse} -> {Laptop Bag}
II. High-Level Rule: {Laptop, Mouse} -> {Bag}
Summary ➖
I. Attribute-Oriented Induction: Generalizes data to higher-level concepts, simplifying and
summarising large datasets.
III. Frequent Itemset Mining: Identifies itemsets that appear frequently in the data.
IL
IV. The Apriori Algorithm: A key algorithm for mining frequent itemsets and generating
association rules.
V. Mining Multilevel Association Rules: Extracts rules from items at different levels of
abstraction for comprehensive insights.
H
SA
20
UNIT ➖04
● Overview of classification, Classification process, Decision tree, Decision Tree
Induction, Attribute Selection Measures. Overview of classifier’s accuracy,
➖
Evaluating classifier’s accuracy, Techniques for accuracy estimation, Increasing
the accuracy of classifier
● Overview of Classification ➖
Classification is a supervised learning technique in data mining and machine learning where the
IL
goal is to assign a label or category to new observations based on the training data. It involves
building a model from a set of labelled examples (training data) and then using that model to
classify new examples.
Classification Process
● Data Collection: ➖
Gather and prepare the data for analysis. This dataset includes both the features (attributes)
H
and the labels (classes).
● Data Preprocessing: ➖
Clean the data by handling missing values, removing outliers, and normalising the data.
Split the dataset into training and testing sets.
● Feature Selection: ➖
SA
Select relevant features that contribute significantly to the prediction process.
● Model Selection: ➖
Choose an appropriate classification algorithm (e.g., decision tree, SVM, k-NN).
● Model Training: ➖
Train the model using the training dataset.
● Model Evaluation: ➖
Evaluate the model using the testing dataset and various performance metrics.
● Model Deployment: ➖
Deploy the model for classifying new, unseen data.
IL
Attribute selection measures (also known as splitting criteria) determine how to split the data at
each node in the decision tree. Common measures include:
● Information Gain: ➖
I. Measures the reduction in entropy or uncertainty after splitting the data based on an
attribute.
II. Formula: Information
Gain=Entropy(S)−∑v∈Values(A)∣Sv∣∣S∣×Entropy(Sv)\text{Information Gain} =
H
\text{Entropy}(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} \times
\text{Entropy}(S_v)Information Gain=Entropy(S)−∑v∈Values(A)∣S∣∣Sv∣×Entropy(Sv)
SA
22
➖
IL
1. Holdout Method:
H
I. Split the dataset into a training set and a testing set (e.g., 70% training, 30% testing).
II. Train the model on the training set and evaluate it on the testing set.
2. Cross-Validation: ➖
I. K-Fold Cross-Validation: Divide the dataset into
II. k subsets, train the model
SA
III. k times, each time using a different subset as the testing set and the remaining as the
training set.
IV. Leave-One-Out Cross-Validation: A special case of k-fold cross-validation where
V. k is equal to the number of samples in the dataset.
3. Bootstrap Method: ➖
I. Randomly sample with replacement from the dataset to create multiple training sets and
evaluate the model on the remaining data.
II. Increasing the Accuracy of Classifier
4. Feature Engineering: ➖
I. Create new features from the existing data that can help improve the model's
performance.
5. Feature Selection: ➖
I. Remove irrelevant or redundant features to reduce noise and improve model
performance.
➖
23
6. Parameter Tuning:
I. Optimise the hyperparameters of the model using techniques like grid search or random
search.
7. Ensemble Methods: ➖
I. Combine multiple models to improve overall performance (e.g., bagging, boosting,
stacking).
8. Data Augmentation: ➖
I. Increase the size of the training data by creating synthetic samples.
9. Regularisation: ➖
I. Add a penalty to the loss function to prevent overfitting (e.g., L1, L2 regularisation).
IL
➖
● Introduction to Clustering, Types of clusters, Clustering methods, Data
visualisation & various data visualisation tools
● Introduction to Clustering ➖
H
Clustering is an unsupervised machine learning technique used to group a set of objects in such
a way that objects in the same group (cluster) are more similar to each other than to those in
other groups (clusters). It helps in discovering the inherent structure in data without any
predefined labels.
● Types of Clusters ➖
➖
SA
● Exclusive Clusters (Hard Clustering):
I. Each data point belongs to exactly one cluster.
II. Example: K-Means clustering.
● Hierarchical Clusters: ➖
I. Clusters are organised into a tree-like structure.
II. Example: Agglomerative clustering.
● Density-Based Clusters: ➖
I. Clusters are formed based on the density of data points.
II. Example: DBSCAN (Density-Based Spatial Clustering of Applications with Noise).
➖
24
● Grid-Based Clusters:
I. The data space is divided into a finite number of cells that form a grid structure.
II. Example: STING (Statistical Information Grid).
● Clustering Methods ➖
● Partitioning Methods: ➖
I. Divide the dataset into a set of k clusters.
II. Examples: K-Means, K-Medoids.
● Hierarchical Methods: ➖
I. Build a hierarchy of clusters either through a top-down (divisive) or bottom-up
(agglomerative) approach.
IL
II. Examples: Agglomerative Hierarchical Clustering, DIANA (Divisive Analysis).
● Density-Based Methods: ➖
I. Identify clusters based on the density of data points in the data space.
II. Examples: DBSCAN, OPTICS (Ordering Points To Identify the Clustering Structure).
● Grid-Based Methods: ➖
I. Quantize the data space into a finite number of cells and perform clustering on these
cells.
H
II. Examples: STING, CLIQUE (Clustering In QUEst).
● Model-Based Methods: ➖
I. Assume a model for each of the clusters and find the best fit of the data to the given
model.
SA
II. Examples: Expectation-Maximization (EM) algorithm, Gaussian Mixture Models (GMM).
● Data Visualization ➖
Data visualisation is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualisation tools provide an accessible way to
see and understand trends, outliers, and patterns in data.
● Power BI: ➖
I. A business analytics service by Microsoft.
II. Provides interactive visualisations and business intelligence capabilities with an interface
simple enough for end users to create their own reports and dashboards.
➖
25
● Matplotlib:
I. A plotting library for the Python programming language.
II. Provides an object-oriented API for embedding plots into applications.
● Ggplot2: ➖
I. A data visualisation package for the R programming language.
II. Based on the Grammar of Graphics, providing a powerful model for creating complex
plots.
● D3.js: ➖
I. A JavaScript library for producing dynamic, interactive data visualisations in web
browsers.
II. Uses HTML, SVG, and CSS.
IL
● Plotly: ➖
I. An open-source graphing library that makes interactive, publication-quality graphs online.
II. Supports multiple languages including Python, R, and JavaScript.
● QlikView: ➖
I. A business intelligence tool for data visualisation.
II. Allows users to create guided analytics applications and dashboards.
H
● Google Data Studio: ➖
I. A free tool by Google to create interactive dashboards and reports.
II. Easily integrates with other Google services like Google Analytics and Google Sheets.
Summary ➖
SA
I. Clustering: An unsupervised technique for grouping similar data points into clusters.
IV. Data Visualization: The graphical representation of data to understand trends and
patterns.
V. Visualisation Tools: Include Tableau, Power BI, Matplotlib, ggplot2, D3.js, Plotly, QlikView,
and Google Data Studio.
DWM(5th)May2018.pdf
SECTION A
IL
1. Briefly answer the following:
a) Differentiate between operational and informational data stores.
b) What is multidimensional data? Give two examples.
c) What is OLAM?
d) Define Data Mining.
e) Briefly discuss the Snowflake schema.
f) Discuss Discovery driven cube.
g) What is a Decision Tree?
H
h) How is the accuracy of a classifier measured?
i) What are the different types of data used in cluster analysis?
j) What are the parameters for selecting and using the right data mining
technique?
➖
b) What is multidimensional data? Give two examples.
1. Multidimensional Data:
I. Data structured in multiple dimensions, often used in OLAP systems for complex queries
and analysis.
II. Each dimension represents a different aspect of the data.
Examples:
I. Sales data: Dimensions could be time, geography, and product.
II. Financial data: Dimensions could be time, department, and expense type.
➖
c) What is OLAM?
1. OLAM (Online Analytical Mining):
I. Combines Online Analytical Processing (OLAP) with data mining techniques.
II. Allows for the dynamic integration of OLAP with data mining, providing real-time analytical
IL
insights.
➖
d) Define Data Mining.
1. Data Mining:
The process of discovering patterns, correlations, and insights from large datasets using
techniques from statistics, machine learning, and database systems.
Aims to transform raw data into useful information.
H
1. Snowflake Schema: ➖
e) Briefly discuss the Snowflake schema.
A type of data warehouse schema where the dimension tables are normalised, splitting data into
additional tables.
It reduces data redundancy but increases the complexity of queries due to the additional joins.
Example: A dimension table for location split into separate tables for country, state, and city.
➖
SA
➖
f) Discuss Discovery driven cube.
1. Discovery-Driven Cube:
I. A data cube that supports data exploration by automatically highlighting interesting and
significant data points.
II. Integrates visual and interactive analysis to help users discover insights without
predefined queries.
III. Facilitates hypothesis generation and validation by dynamically adjusting views based on
data interactions.
➖
g) What is a Decision Tree?
1. Decision Tree:
A flowchart-like structure used for classification and regression tasks.
Each internal node represents a decision based on an attribute, each branch represents an
outcome of the decision, and each leaf node represents a class label or continuous value.
Example: A decision tree for weather prediction where nodes represent conditions like
temperature, humidity, and wind.
28
h) How is the accuracy of a classifier measured?
Accuracy of a Classifier:
Measured by the proportion of correctly predicted instances to the total instances.
Formula:
Accuracy
=
Number of Correct Predictions
Total Number of Predictions
Accuracy=
Total Number of Predictions
Number of Correct Predictions
IL
Other metrics: Precision, Recall, F1 Score, ROC-AUC.
j) What are the parameters for selecting and using the right data mining technique?
➖
SA
Parameters for Selecting Data Mining Technique:
I. Nature of Data: Type, structure, and quality of data (numeric, categorical, binary, etc.).
II. Goal of Analysis: Classification, clustering, regression, association, etc.
III. Data Volume: Size of the dataset and scalability requirements.
IV. Model Interpretability: Need for understandable models vs. black-box models.
V. Accuracy Requirements: Desired level of prediction accuracy.
VI. Computational Resources: Availability of processing power and memory.
VII. Time Constraints: Time available for model training and prediction.
VIII. Tool and Algorithm Availability: Access to specific data mining tools and algorithms.
SECTION B
2. Define Data Warehousing. What is the need for data warehousing? Discuss the
structure of a data warehouse.
3. What do you mean by data pre-processing? Explain the various stages in the process
of data pre-processing.
29
5. Explain Association Rule Mining. What are the various algorithms for generating
association rules? Discuss with examples.
Answers ➖
IL
SECTION B
2. Define Data Warehousing. What is the need for data warehousing? Discuss the structure of a data
warehouse.
Data Warehousing: A data warehouse is a centralized repository that stores integrated data from
multiple heterogeneous sources. It is designed to facilitate reporting, analysis, and
decision-making processes by providing a consolidated view of an organization's data.
H
Need for Data Warehousing:
● Integration of Data: Combines data from different sources into a unified format, making it
easier to analyze and report.
● Historical Analysis: Maintains historical data for trend analysis and long-term business
strategy planning.
SA
● Improved Query Performance: Optimizes query performance through indexing,
summarization, and denormalization.
● Decision Support: Enhances decision-making by providing accurate, consistent, and
comprehensive data.
● Data Quality and Consistency: Ensures data quality, accuracy, and consistency across
the organization.
● Support for OLAP and Data Mining: Facilitates complex queries and data mining
operations.
1. Data Sources:
○ Operational databases, external data sources, flat files, etc.
○ ETL (Extract, Transform, Load) process extracts data from these sources.
2. ETL Process:
○ Extraction: Data is extracted from various sources.
○ Transformation: Data is cleaned, transformed, and standardized.
○ Loading: Transformed data is loaded into the data warehouse.
3. Data Staging Area:
30
○ Temporary storage area where data is processed and transformed before loading
into the data warehouse.
4. Data Storage:
○ Centralized Repository: Stores integrated and transformed data.
○ Data Marts: Subsets of the data warehouse tailored for specific departments or
business areas.
○ Metadata Repository: Stores information about data sources, data
transformations, data schemas, etc.
5. Data Access Layer:
○ Provides tools and interfaces for querying, reporting, and analyzing the data.
○ Includes OLAP tools, data mining tools, and business intelligence tools.
6. Front-End Tools:
○ Dashboards, reporting tools, and data visualization tools used by end-users to
IL
interact with the data warehouse.
3. What do you mean by data pre-processing? Explain the various stages in the process
of data pre-processing.
Data Pre-Processing: Data pre-processing is the process of transforming raw data into a clean
and usable format, preparing it for further analysis or modeling. It is a crucial step to ensure data
quality and improve the performance of data mining algorithms.
1. Data Cleaning:
H
Stages in Data Pre-Processing:
OLAP (Online Analytical Processing): OLAP is a category of software tools that provides
analysis of data stored in a database. It enables users to perform complex queries and analysis,
often involving large amounts of data, with the purpose of discovering insights and making
business decisions.
Architecture of OLAP:
1. OLAP Cube:
IL
○ A multi-dimensional array of data, which allows data to be modeled and viewed in
multiple dimensions.
2. OLAP Servers:
○ ROLAP (Relational OLAP): Stores data in relational databases and uses complex
SQL queries for data analysis.
○ MOLAP (Multidimensional OLAP): Stores data in multi-dimensional cube
structures and pre-computes aggregates for fast querying.
○ HOLAP (Hybrid OLAP): Combines ROLAP and MOLAP, storing detailed data in
3. Data Sources:
H
relational databases and aggregates in multi-dimensional cubes.
○ Includes operational databases, data warehouses, and external data sources from
which data is extracted.
4. ETL Process:
○ Extracts data from various sources, transforms it into a suitable format, and loads it
into OLAP servers.
SA
5. OLAP Engine:
○ The core processing engine that handles complex queries, performs computations,
and retrieves data from OLAP cubes or relational databases.
6. OLAP Tools:
○ Query and Reporting Tools: Allow users to create and execute queries, generate
reports, and visualize data.
○ Analysis Tools: Provide functionalities for drill-down, roll-up, slicing, dicing, and
pivoting.
7. User Interface:
○ The front-end interface through which users interact with the OLAP system, often
providing graphical and interactive capabilities.
5. Explain Association Rule Mining. What are the various algorithms for generating
association rules? Discuss with examples.
Association Rule Mining: Association rule mining is a technique in data mining that discovers
interesting relationships, patterns, and associations among a set of items in large datasets,
typically transactional databases.
32
Key Concepts:
1. Apriori Algorithm:
○ Generates candidate itemsets and prunes infrequent ones using support threshold.
○ Example: In a supermarket dataset, finding that customers who buy bread also buy
butter with a certain support and confidence.
2. FP-Growth (Frequent Pattern Growth):
○ Uses a divide-and-conquer strategy and constructs an FP-tree to find frequent
IL
itemsets without candidate generation.
○ Example: Identifying frequently co-purchased items in an e-commerce platform
using a compact FP-tree structure.
3. Eclat Algorithm:
○ Uses a depth-first search approach and vertical data format (transaction IDs) for
discovering frequent itemsets.
○ Example: Mining frequent itemsets in text data where items are words and
transactions are documents.
H
6. Discuss Bootstrapping, Boosting and Bagging with examples.
Bootstrapping:
Boosting:
● An ensemble method that trains multiple models on different bootstrap samples and
aggregates their predictions.
● Example: Random Forest, which builds multiple decision trees on different subsets of the
data and averages their predictions.
1. K-Means Clustering:
○ Partitions the data into K clusters by minimizing the sum of squared distances
between data points and the cluster centroids.
○ Example: Segmenting customers based on purchasing behavior.
2. Hierarchical Clustering:
○ Builds a hierarchy of clusters either through a bottom-up (agglomerative) or
top-down (divisive) approach.
IL
○ Example: Creating a dendrogram to visualize the hierarchical relationships
between clusters.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
○ Forms clusters based on the density of data points, identifying core points, border
points, and noise.
○ Example: Identifying clusters in geographical data where clusters are regions of
high data density.
4. Gaussian Mixture Models (GMM):
H
○ Assumes that the data is generated from a mixture of several Gaussian
distributions and uses the EM algorithm to estimate the parameters.
○ Example: Clustering data points that follow a normal distribution into multiple
Gaussian clusters.
5. Agglomerative Hierarchical Clustering:
○ Recursively merges the closest pair of clusters until all points are in a single cluster.
○ Example: Clustering gene expression data to find
SA
34
DWM(5th)May2019.pdf
SECTION-A
Q1. Answer briefly :
a) What are operational and informational data stores?
b) What is OLAP data warehouse?
c) Write about the importance of Artificial Intelligence.
d) What are data cubes?
e) What is data summarization?
f) How does decision tree work?
IL
g) Write about market based analysis.
h) What is the accuracy of classifier?
i) What is cross validation?
j) Why do we use data visualization?
Answers ➖ H
SECTION A
IL
● Data Cubes:
○ Multidimensional arrays of data used in OLAP systems.
○ Organize data into dimensions (e.g., time, location, product) and measures (e.g.,
sales, profit).
○ Facilitate complex queries and data analysis by providing a structured and intuitive
way to navigate and summarize large datasets.
● Decision Tree:
○ A flowchart-like structure used for classification and regression tasks.
○ Nodes: Represent decisions or tests based on an attribute.
○ Branches: Represent outcomes of the decision or test.
○ Leaf Nodes: Represent class labels or continuous values.
○ The tree is built by recursively splitting the dataset based on attribute values that
maximize the separation of classes (using measures like Gini impurity, entropy, or
information gain).
● Market-Based Analysis:
○ Also known as market basket analysis, it is a data mining technique used to
understand the purchase behavior of customers.
○ Identifies patterns and associations between items purchased together.
36
○ Example: Discovering that customers who buy bread also frequently buy butter.
○ Utilized in retail for cross-selling, product placement, and inventory management.
● Accuracy of a Classifier:
○ The proportion of correctly predicted instances to the total instances.
○ Formula: Accuracy=Number of Correct PredictionsTotal Number of
Predictions\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total
Number of Predictions}}Accuracy=Total Number of PredictionsNumber of Correct
Predictions
○ Accuracy is a measure of the classifier's performance, indicating how well it
predicts the correct labels.
IL
i) What is cross validation?
● Cross Validation:
○ A technique for evaluating the performance of a model by partitioning the data into
training and testing sets multiple times.
○ K-Fold Cross Validation: The dataset is divided into k subsets (folds). The model
is trained on k-1 folds and tested on the remaining fold. This process is repeated k
times, with each fold used as the test set once.
H
○ Helps in assessing the model's generalizability and reducing overfitting.
● Data Visualization:
○ Transforms data into graphical representations, such as charts, graphs, and maps.
○ Importance:
SA
■ Insight Discovery: Reveals patterns, trends, and correlations that may not
be apparent in raw data.
■ Communication: Simplifies complex data, making it easier to understand
and communicate findings.
■ Decision Making: Supports data-driven decision making by presenting
information clearly and intuitively.
■ Exploration: Allows interactive exploration of data, enabling users to drill
down into details and uncover insights.
37
SECTION-B
Q2 How much does a data warehouse cost? Write their applications and uses.
Q3 Discuss the steps of building data warehouse by considering various technical
aspects.
Q4 What is multidimensional data model? Discuss the schemas for multidimensional
data.
Q5 What is association rule mining? Explain Apriori algorithm in data mining.
Q6 Define clustering? Why clustering is important in Data Mining? Write its uses.
Q7 What are different types of Data Mining Techniques? Explain any one in detail?.
Answers ➖
IL
SECTION B
Q2. How much does a data warehouse cost? Write their applications and uses.
● The cost of a data warehouse can vary significantly based on several factors:
equipment.
H
○ Infrastructure: Hardware costs for servers, storage devices, and networking
○ Software: Costs for database management systems (DBMS), ETL tools, and
analytics software.
○ Development: Costs associated with data modeling, ETL development, and
integration.
○ Maintenance: Ongoing costs for data cleansing, monitoring, and support.
SA
Applications and Uses:
● Business Intelligence: Provides a consolidated view of data for reporting and analytics.
● Decision Support: Supports strategic decision-making based on historical and real-time
data.
● Predictive Analytics: Enables forecasting and predictive modeling using advanced
analytics techniques.
● Operational Efficiency: Improves efficiency by streamlining data access and analysis.
● Customer Insights: Helps in understanding customer behavior, preferences, and trends.
● Regulatory Compliance: Facilitates compliance reporting and audit trails.
● Risk Management: Identifies and mitigates risks through data-driven insights.
● Marketing and Sales: Enhances targeting, campaign effectiveness, and sales
forecasting.
Q3. Discuss the steps of building a data warehouse by considering various technical aspects.
IL
○ Load transformed data into the data warehouse.
5. Data Storage and Organization:
○ Choose appropriate storage structures (tables, indexes, partitions).
○ Implement data partitioning, clustering, and indexing for performance optimization.
6. Metadata Management:
○ Develop and maintain metadata repository.
○ Document data lineage, transformations, and data definitions.
7. OLAP Cube Design:
8.
H
○ Design and build OLAP cubes for multidimensional analysis.
○ Define dimensions, measures, hierarchies, and aggregation levels.
Implementation and Testing:
○ Deploy data warehouse components (database, ETL processes, OLAP cubes).
○ Conduct unit testing, integration testing, and performance testing.
9. Deployment and Maintenance:
SA
○ Deploy the data warehouse to production environment.
○ Establish monitoring and maintenance processes for data quality, performance
tuning, and backup/recovery.
Q4. What is a multidimensional data model? Discuss the schemas for multidimensional data.
1. Star Schema:
○ Central fact table surrounded by dimension tables.
○ Simple and denormalized structure, suitable for querying and reporting.
○ Example: Sales fact table linked to product, time, and location dimensions.
2. Snowflake Schema:
○ Extends the star schema by normalizing dimension tables.
○ Reduces data redundancy but increases complexity due to additional joins.
39
○ Example: Product dimension further normalized into product category and
subcategory tables.
3. Fact Constellation (Galaxy) Schema:
○ Multiple fact tables share dimension tables.
○ Suitable for complex business processes with multiple interrelated fact tables.
○ Example: Separate fact tables for sales and inventory linked to common product
and time dimensions.
Q5. What is association rule mining? Explain the Apriori algorithm in data mining.
IL
databases.
Apriori Algorithm:
● Algorithm Steps:
○ Generate Candidate Itemsets: Identify frequent items (single items) and combine
them to form candidate itemsets.
○ Calculate Support: Count the occurrences of each candidate itemset in the
dataset.
H
○ Prune Non-Frequent Itemsets: Remove candidate itemsets that do not meet the
minimum support threshold.
○ Generate Association Rules: From frequent itemsets, generate rules with a
minimum confidence threshold.
● Example: In a supermarket dataset:
○ Support: Percentage of transactions containing a specific itemset.
SA
○ Confidence: Likelihood of one item being purchased given the purchase of another
item.
Q6. Define clustering? Why clustering is important in Data Mining? Write its uses.
Clustering:
Q7. What are different types of Data Mining Techniques? Explain any one in detail.
IL
2. Regression: Predicts continuous values or numeric outcomes.
3. Clustering: Groups similar data points into clusters based on similarity.
4. Association Rule Mining: Finds interesting relationships between variables in large
datasets.
5. Anomaly Detection: Identifies outliers or unusual patterns in data.
6. Sequential Pattern Mining: Discovers sequential patterns or trends in data sequences.
7. Text Mining: Extracts meaningful patterns and relationships from unstructured text data.
Explanation - Classification:
H
● Definition: Classification is a supervised learning technique used to predict categorical
labels or class memberships for new data points based on past observations.
● Process:
○ Training Phase: Learn patterns and relationships from labeled training data using
algorithms like Decision Trees, Naive Bayes, or Support Vector Machines.
SA
○ Prediction Phase: Apply the learned model to new data to predict class labels.
● Example:
○ Application: Email Spam Detection.
○ Process: Train a classifier using labelled emails (spam or not spam). The classifier
learns patterns (e.g., keywords, sender) associated with spam emails. When a new
email arrives, the classifier predicts whether it is spam or not based on learned
patterns.