0% found this document useful (0 votes)

29 views19 pages

Data Warehouse

Uploaded by

SAYAN MUSIC

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views19 pages

Data Warehouse

Uploaded by

SAYAN MUSIC

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

1.

Definition of a Data Warehouse:A data warehouse is a centralized repository that

stores large volumes of structured and semi-structured data from various sources. It
is designed for query and analysis rather than transaction processing. Data is
consolidated, transformed, and stored to support decision-making processes.

2.Need for a Separate Data Warehouse

Data Integration: Combines data from disparate sources, providing a unified view.
Historical Analysis: Stores historical data, enabling trend analysis over time.
Performance: Optimizes query performance for complex analytical queries.
Data Quality: Ensures consistency, accuracy, and reliability through data cleansing.
Decision Support: Facilitates business intelligence activities like reporting, data
mining, and OLAP (Online Analytical Processing).

3.Enterprise Data Warehouse (EDW): An Enterprise Data Warehouse is a

centralised repository that stores integrated data from multiple business sources. It
provides a comprehensive and consistent view of the enterprise's data, supporting
decision-making at all levels of the organisation.

4.Enterprise Data Warehouse

Key Characteristics:
Centralization: All data is stored in a single, central repository.
Integration: Data from various sources is cleaned, transformed, and integrated to
ensure consistency.
Scalability:Designed to handle large volumes of data and a wide variety of data types.
Complexity: Often complex to design and maintain due to the need for
comprehensive data integration and storage solutions.
Cost: Typically more expensive to implement and maintain compared to other
models due to the scale and complexity.

5.Data Mart: A Data Mart is a subset of a data warehouse, typically focused on a

specific business line or team. It is designed to meet the needs of a particular group
of users, such as the sales department, finance team, or marketing department.

6.Data Mart Key Characteristics:

Focused Scope: Limited to specific business areas or departments.
Simpler Design: Easier to implement and maintain compared to an EDW.
Faster Access: Designed to provide quicker access to relevant data for the targeted user
group.
Cost-Effective: Less expensive to implement due to the reduced scope and complexity.

Data Mart Types: Dependent Data Mart: Extracted from an existing EDW,
ensuring consistency with the central data repository.
Independent Data Mart: Built directly from source systems, independent of an EDW.
7.Virtual Warehouse:A Virtual Warehouse is a logical data warehouse that
provides a unified view of data without physically consolidating it into a single
repository. It uses virtualization technologies to create a layer that allows users to
query and analyze data from multiple sources as if it were stored in one place.

8.Virtual Warehouse Key Characteristics:

Logical Integration: Data remains in its original source systems but is accessed
and integrated virtually.
Flexibility: Easier to adapt to changes in the data environment.
Lower Cost: Reduced need for physical storage and data movement.
Performance: Can be affected by the performance of the underlying source systems
and the virtualization layer.

9.Differences between Operational Database Systems and Data Warehouses

Operational Database Data Warehouse

i.Operational systems are designed to i.Data warehousing systems are

support high-volume transaction typically designed to support
processing. high-volume analytical processing
ii.Operational systems are usually ii.Data warehousing systems are usually
concerned with current data. concerned with historical data
iii.Data within operational systems are iii.Non-volatile, new data may be added
mainly updated regularly according to regularly.
need. iv.It is designed for analysis of business
iv.It is designed for real-time business measures by subject area, categories,
dealing and processes. and attributes.
v.It supports thousands of concurrent v.It supports a few concurrent clients
clients. relative to OLTP.
vi.Less Number of data accessed. vi.Large Number of data accessed.

10.Data Cube: A Data Cube is a multi-dimensional array of values used to

represent data in a data warehouse. It allows data to be modelled and viewed in
multiple dimensions, which are often referred to as the attributes or features of the
data.

11.Key Characteristics of Data Cube:

Dimensions: Represent the perspectives or entities with respect to which an
organisation wants to keep records (e.g., time, geography, products).
Facts:Numeric data points of interest,usually aggregatable measures like sales,
Hierarchies: Each dimension can have hierarchical levels
Dice: A sub-cube created by selecting specific values for multiple dimensions.
12.Conceptual Modeling of Data Warehouse:Conceptual modeling of a data
warehouse involves creating a high-level blueprint that outlines the structure and
organization of the data warehouse. It focuses on defining the main entities,
relationships, and data flows without worrying about the technical implementation
details.

13.Key Techniques of Conceptual Modeling of Data Warehouse:

Star Schema: The most common data warehouse schema. It consists of a central
fact table surrounded by dimension tables. Each dimension table contains attributes
related to the dimensions.
Snowflake Schema: A more normalized form of the star schema where dimension
tables are further normalized into multiple related tables.

14.Concept Hierarchies:Concept hierarchies are a way to organize data into

multiple levels of granularity, allowing users to navigate through different layers of
data abstraction. They define a sequence of mappings from a set of low-level
concepts to higher-level, more general concepts.

15.Measures: Their Categorization and Computation

Measures: Measures are the quantitative data points in a data warehouse,

representing the metrics that users want to analyze.

Categorization of Measures:
Distributive Measures:Can be aggregated in any order and the result will be the
same.Example: SUM, COUNT.
Algebraic Measures:Can be computed using a finite number of distributive
measures. Example: AVERAGE (computed as SUM/COUNT), RATIO.
Holistic Measures:Require the entire dataset to compute the result.Example:
MEDIAN, MODE.

Computation of Measures:Measures are typically computed using aggregate

functions such as SUM, COUNT, AVERAGE, MIN, and MAX.
They can be derived or calculated using complex formulas involving
basic measures

18.OLEP (Online Evolutionary Processing):While OLAP focuses on the analytical

processing of historical data, OLEP refers to a paradigm often associated with the
processing of evolving or real-time data, supporting continuous and adaptive
queries. This is not as commonly referenced as OLAP, but it generally deals with data
that changes over time and requires immediate, adaptive responses.
*what is metadata:Metadata is data that provides information about other data, such as
details that describe the content, quality, structure, and management of a dataset. For
example, metadata for a digital photo might include the date it was taken, the camera
settings, and the location.
16.OLAP Operations:OLAP operations enable users to analyze multi-dimensional
data interactively, allowing for insights from different perspectives and granularities.
These operations are typically performed on a multi-dimensional data model or a
data cube. Here are the key OLAP operations:

Roll-Up: Aggregates data by climbing up a concept hierarchy or by reducing

dimensions. Example: Rolling up sales data from the day level to the month level.

Drill-Down: Breaks down data into finer levels of detail, the opposite of roll-up.
Example: Drilling down from the year level to the quarter level in sales data.

Slice: Selects a single dimension from the multi-dimensional data cube,resulting in a

sub-cube.Example: Slicing data to view sales figures for a specific year only.

Dice: Selects two or more dimensions to create a sub-cube, providing a more

focused dataset. Example: Dicing data to view sales figures for specific products
and regions.

Pivot (Rotate): Reorients the data cube, allowing data to be viewed from different
perspectives.Example: Rotating the cube to swap rows and columns in a report to
get a different view

17.Operations in the Multidimensional Data Model (OLAP):The multi-dimensional

data model is foundational for OLAP operations, enabling sophisticated data analysis
through various operations:

Aggregation:Summarizes data along one or more dimensions.Example: Summing

up sales figures for each product category.

Navigation:Involves moving through different levels of data detail, including

drill-down and roll-up operations.Example: Navigating from yearly sales data to
monthly sales data.

Selection:Filters data to focus on specific criteria.Example: Selecting sales data for

a particular region or time period.

Computation:Performs calculations on data, such as averages, ratios, and

percentages.Example: Calculating the average sales per customer.

18.Data Warehouse Design and Usage

Design:: Requirement gathering, data modeling (conceptual, logical, physical), ETL
process design, metadata management, and architecture planning (centralized,
federated, hybrid).Usage:: Supports complex querying and data analysis for
informed decision-making. Tools: OLAP operations (roll-up, drill-down, slice, die,
From Online Analytical Processing to Multidimensional Data Mining
19.From Online Analytical Processing to Multidimensional Data Mining

Online Analytical Processing (OLAP): Purpose: Facilitates complex queries and

analysis of data in a multidimensional format.
Operations: Roll-up, drill-down, slice, dice, and pivot.
Tools: OLAP servers (ROLAP, MOLAP, HOLAP) and cube structures.

Multidimensional Data Mining: Purpose: Discover patterns, correlations, and

anomalies in large datasets.
Techniques: Association rule mining, classification, clustering, regression, outlier
detection.
Integration with OLAP: Applies data mining techniques to multidimensional data
stored in OLAP cubes, enhancing analysis with trend analysis, forecasting, and
predictive modeling.

20.Data Warehouse Implementation

Implementation Steps:

Planning and Analysis: Objective: Define project scope, objectives, timelines, and conduct
feasibility studies. Activities: Risk assessment and assembling a project team.

Design: Objective: Create detailed data models and ETL processes.

Activities: Develop conceptual, logical, and physical schemas; plan for data quality, security,
and governance.

Development:Objective: Build the data warehouse infrastructure.

Activities: Set up servers and storage, implement ETL processes, develop the data
warehouse database and metadata repository.

Testing:Objective: Ensure the system works correctly.

Activities: Perform unit testing, integration testing, validate data accuracy and consistency,
conduct user acceptance testing (UAT).

Deployment:Objective: Make the data warehouse operational.

Activities: Migrate data, set up access controls, train end-users, roll out the system.

Maintenance and Support:Objective: Keep the data warehouse running smoothly.

Activities: Monitor performance, update data, address issues, plan for scalability and
system upgrades.

21..What is Data Mining:It is the process of discovering patterns, trends, correlations,

and anomalies within large datasets using techniques from statistics, machine learning, and
database systems. The goal is to extract valuable information from raw data and transform it
into an understandable structure for further use, such as decision-making, prediction, and
knowledge discovery.
22.Process of Knowledge Discovery in Databases (KDD)

The process of Knowledge Discovery in Databases (KDD) is a comprehensive

process of converting raw data into useful information and knowledge. It consists of
several steps:

1. Data Cleaning:
○ Purpose: Remove noise and correct inconsistencies in the data.
○ Activities: Handling missing values, correcting errors, and smoothing
noisy data.
2. Data Integration:
○ Purpose: Combine data from multiple sources into a coherent dataset.
○ Activities: Merging databases, data warehouses, or different data
formats.
3. Data Selection:
○ Purpose: Select relevant data for analysis.
○ Activities: Choosing a subset of attributes or records from the dataset.
4. Data Transformation:
○ Purpose: Transform data into suitable formats for mining.
○ Activities: Normalization, aggregation, generalization, and feature
extraction.
5. Data Mining:
○ Purpose: Apply algorithms to extract patterns from the data.
○ Activities: Using techniques such as classification, regression,
clustering, association, etc.

23.Example of KDD Process

Consider an e-commerce company that wants to understand customer purchasing

behaviors to improve marketing strategies:

1. Data Cleaning:Remove duplicate entries, correct data entry errors, handle

missing values in transaction records.
2. Data Integration:Combine customer data from CRM systems, web analytics,
and transaction databases into a single dataset.
3. Data Selection:Select relevant attributes such as customer demographics,
purchase history, and browsing behavior.

28.Association Rule Learning: Association rule learning is a method used in data

mining to discover interesting relationships, patterns, or associations among a set of
items in large datasets. It aims to identify rules that predict the occurrence of an item
based on the occurrences of other items.
24.Types of Repositories

1. Data Warehouses:

● Description: Centralized repositories that store integrated data from multiple

sources, designed for query and analysis.
● Characteristics: Structured, subject-oriented, time-variant, and non-volatile.
● Use Case: Business intelligence and reporting, historical data analysis.

2. Databases:

● Description: Structured collections of data, organized in tables and managed by

database management systems (DBMS).
● Characteristics: Organized into schemas, supports transactions, ensures data
integrity.
● Use Case: Online transaction processing (OLTP), data storage for applications.

3. Data Lakes:

● Description: Storage repositories that hold large amounts of raw data in its native
format until it is needed.
● Characteristics: Highly scalable, supports structured and unstructured data,
schema-on-read.
● Use Case: Big data analytics, storing unstructured data, data exploration.

25.Data Mining Tasks

1. Descriptive Tasks:

● Clustering: Grouping similar data objects into clusters based on their characteristics.
○ Example: Market segmentation to identify distinct customer groups.
● Association Rule Mining: Discovering interesting relationships between variables in
large datasets.
○ Example: Market basket analysis to find products frequently bought together.
● Summarization: Providing a compact description of a dataset.
○ Example: Generating a summary report of sales data.

2. Predictive Tasks:

● Classification: Assigning items to predefined categories or classes.

○ Example: Email spam detection.
● Regression: Predicting a continuous-valued attribute based on input variables.
○ Example: Predicting house prices based on features like size, location, and
age.
● Time Series Analysis: Analyzing time-ordered data to extract meaningful statistics
and characteristics.
○ Example: Forecasting stock prices or weather conditions.
3. Sequential Pattern Mining: Discovering regular sequences of
events or patterns over time.
26.Data Mining Trends

1. Big Data:

● Description: Managing and analyzing large volumes of data that are beyond the
capability of traditional database systems.
● Trend: Leveraging technologies like Hadoop, Spark, and distributed computing for
big data analytics.

2. Cloud Computing:

● Description: Utilizing cloud resources for scalable and flexible data mining
operations.
● Trend: Adoption of cloud platforms (e.g., AWS, Google Cloud, Azure) for data
storage and analytics.

3. Real-Time Data Mining:

● Description: Analyzing data as it is generated to provide immediate insights.

● Trend: Use of streaming data processing frameworks (e.g., Apache Kafka, Apache
Flink).

27.Data Mining Issues

1. Data Quality:

● Description: Ensuring the accuracy, completeness, and consistency of data.

● Issue: Handling noisy, incomplete, and inconsistent data that can affect the results of
data mining.

2. Scalability:

● Description: Efficiently processing and analyzing large datasets.

● Issue: Developing algorithms that can scale with the increasing volume and
complexity of data.

3. Data Integration:

● Description: Combining data from various heterogeneous sources into a unified

dataset.
● Issue: Addressing challenges related to data format, schema integration, and
semantic consistency.

4. Privacy Concerns:

● Description: Protecting sensitive data from unauthorized access and misuse.

● Issue: Ensuring that data mining practices comply with data protection regulations
(e.g., GDPR).

.
29.How Association Rule Learning Works:

1. Identify Frequent Itemsets: Find all sets of items (itemsets) that have support
above a certain threshold.
2. Generate Association Rules: From the frequent itemsets, generate rules that have
confidence above a certain threshold.
3. Evaluate and Prune: Evaluate the generated rules using metrics like lift and prune
the ones that are not interesting.

30.Apriori Algorithm: The Apriori algorithm is a classic algorithm used to find frequent
itemsets and generate association rules. It uses a bottom-up approach where frequent
subsets are extended one item at a time (known as candidate generation), and groups of
candidates are tested against the data.

31.Steps of the Apriori Algorithm:

1. Generate Candidate Itemsets: Start with itemsets of length 1. Generate larger

itemsets by combining the smaller itemsets.
2. Calculate Support: For each candidate itemset, calculate its support.
3. Prune: Remove itemsets that do not meet the minimum support threshold.
4. Repeat: Repeat the process to generate itemsets of increasing length until no more
frequent itemsets are found.
5. Generate Rules: From the frequent itemsets, generate rules and calculate their
confidence.

Example:

● Dataset: {1, 2, 3}, {1, 2}, {2, 3}, {1, 3}, {2, 3}
● Minimum Support Threshold: 0.4 (2 out of 5 transactions)

32.FP-Growth Algorithm

FP-Growth Algorithm: The FP-Growth (Frequent Pattern Growth) algorithm is an

alternative to the Apriori algorithm that eliminates the need for candidate generation. It uses
a divide-and-conquer strategy by constructing a compact data structure called the FP-tree
(Frequent Pattern Tree) and then extracting frequent itemsets directly from this tree.

Steps of the FP-Growth Algorithm:

1. Build the FP-Tree:

○ Scan the database to determine the frequency of each item.
○ Order the items by frequency and construct the FP-tree by inserting
transactions.
2. Mine the FP-Tree:
○ Starting from the root, extract conditional patterns and generate frequent
itemsets.

Example:

● Dataset: {A, B, C}, {A, B}, {B, C}, {A, C}, {B, C}
33.Applications of Association Rule Learning:

1. Market Basket Analysis: Identifying products frequently bought together.

2. Cross-Selling: Recommending additional products based on customer purchases.
3. Fraud Detection: Detecting unusual patterns that may indicate fraudulent activity.
4. Healthcare: Discovering associations between symptoms and diseases.
5. Web Usage Mining: Understanding user navigation patterns on websites.

34.Unsupervised Learning: Unsupervised learning is a type of machine learning where the

algorithm is trained on unlabeled data, meaning there are no predefined labels or outcomes.
The goal is to infer the natural structure present within a set of data points. The most
common tasks in unsupervised learning are clustering and association.

35.Clustering Algorithms

1. K-Means Clustering: Partition the dataset into K clusters, where each data point belongs
to the cluster with the nearest mean (centroid).

● Algorithm Steps:
1. Initialize K centroids randomly.
2. Assign each data point to the nearest centroid.
3. Recalculate the centroids as the mean of the points in each cluster.
4. Repeat steps 2 and 3 until convergence (centroids no longer change).

2.K-Medoids Clustering (PAM): Similar to K-Means but uses medoids (representative

points) instead of means to define clusters.

● Algorithm Steps:
1. Initialize K medoids randomly.
2. Assign each data point to the nearest medoid.
3. For each medoid, replace it with a non-medoid point and compute the total
cost of the configuration.
4. If the total cost decreases, adopt the new configuration; otherwise, keep the
existing medoid.
5. Repeat steps 2 and 3 until convergence.

3.Hierarchical Clustering: Create a tree of clusters (dendrogram) that illustrates the

arrangement of the clusters produced.

● Types:
1. Agglomerative (bottom-up): Start with each data point as a single cluster and
merge the closest pairs of clusters iteratively.
2. Divisive (top-down): Start with all data points in one cluster and recursively
split them into smaller clusters.
● Algorithm Steps:
1. Compute the distance matrix.
2. Find the closest pair of clusters and merge them.
3. Update the distance matrix to reflect the new cluster.
4. Repeat until all points are in a single cluster or a stopping criterion is met.
4. Graph-Based Clustering:

● Basic Idea: Model the dataset as a graph where each node represents a data point,
and edges represent the similarity between points.
● Algorithm Steps:
1. Construct a similarity graph (e.g., k-nearest neighbor graph).
2. Apply graph partitioning methods to find clusters (e.g., spectral clustering).

36.Cluster Analysis Basics, Cluster Evaluation

Cluster Analysis Basics:

● Objective: Organize a set of objects into clusters such that objects in the same
cluster are more similar to each other than to those in other clusters.
● Applications: Customer segmentation, image segmentation, document clustering,
bioinformatics.

37. Outlier Detection and Analysis

Outlier Detection: Outlier detection is the process of identifying data points that significantly
differ from the rest of the dataset. These points can indicate variability in the data or signal
an abnormal behavior.

Methods of Outlier Detection:

1. Statistical Methods: Assume a distribution for the data (e.g., Gaussian) and identify
points that deviate significantly from this distribution.
○ Example: Z-score, Grubbs' test.
2. Distance-Based Methods: Identify points that are far from their neighbors.
○ Example: k-Nearest Neighbors (k-NN) based outlier detection.
3. Density-Based Methods: Identify points in low-density regions as outliers.
○ Example: Local Outlier Factor (LOF).
4. Clustering-Based Methods: Treat points not belonging to any cluster or in very
small clusters as outliers.
○ Example: DBSCAN (Density-Based Spatial Clustering of Applications with
Noise).

Outlier Analysis:

● Applications: Fraud detection, network security, fault detection, data cleaning.

● Challenges: High dimensionality, scalability, mixed-type data, defining what
constitutes an outlier.

38.Supervised Learning: Supervised learning is a type of machine learning where

the model is trained on a labeled dataset. Each training example consists of an input
and a corresponding desired output, also known as the label. The goal is to learn a
mapping from inputs to outputs that can be used to predict the labels of new, unseen
data.
39.Classification in Supervised Learning

Classification: Classification is a supervised learning task where the objective is to

categorize input data into predefined classes or categories. The model is trained on a
dataset containing input-output pairs and learns to assign a class label to new instances
based on the input features.

40.Issues Regarding Classification:

1. Overfitting: The model performs exceptionally well on the training data but poorly on
new, unseen data due to its complexity.
2. Underfitting: The model is too simple to capture the underlying patterns in the data,
resulting in poor performance on both training and test data.
3. Imbalanced Data: When the classes in the dataset are not equally represented,
leading to a model biased towards the majority class.
4. Feature Selection: Identifying the most relevant features that contribute to the
prediction, which can improve model performance and reduce complexity.
5. Noise: Presence of irrelevant or erroneous data points that can affect the model's
accuracy

41.Types of Classifiers:

1. Binary Classification:
○ Description: Classifies data into two distinct classes.
○ Examples: Spam vs. non-spam emails, disease present vs. disease absent.
○ Common Algorithms: Logistic Regression, Support Vector Machines (SVM),
Decision Trees, Naïve Bayes.
2. Multiclass Classification:
○ Description: Classifies data into more than two classes.
○ Examples: Classifying types of fruits (e.g., apple, banana, orange),
categorizing news articles into different topics.
○ Common Algorithms: Decision Trees, Random Forest, Neural Networks,
k-Nearest Neighbors (k-NN), Naïve Bayes.

42.Classification Approaches

1. Bayesian Classification - Naïve Bayes:

● Basic Idea: Applies Bayes' theorem with the assumption that features are
independent given the class.
● Bayes' Theorem: P(C/X)=P(X/C)⋅P(C)/P(X)

Types:

● Gaussian Naïve Bayes: Assumes features follow a Gaussian distribution.

● Multinomial Naïve Bayes: Used for discrete data (e.g., word counts in text).
● Bernoulli Naïve Bayes: Used for binary/Boolean features.
2. Association-Based Classification:

● Basic Idea: Uses association rules discovered in data to build a classifier.

● Steps:
1. Discover frequent itemsets using algorithms like Apriori.
2. Generate association rules from these itemsets.
3. Build a classifier by selecting rules with high confidence and support.
● Example: If a customer buys bread and butter, they are likely to buy milk.
● Pros: Can handle categorical data, interpretable rules.
● Cons: Can be computationally expensive, especially with a large number of rules.

3. Rule-Based Classifier:

● Basic Idea: Uses a set of "if-then" rules for classification.

● Rule Format: If condition(s) -> then class.
● Rule Generation:
○ Direct Method: Extract rules directly from the data using algorithms like
RIPPER, CN2.
○ Indirect Method: Extract rules from other classifiers like decision trees (e.g.,
C4.5).
● Example: If age < 30 and income = high, then class = “young professional”.
● Pros: Interpretable, flexible, easy to implement.
● Cons: May not handle continuous features well, rule conflict resolution can be
complex.

43.Example Classification Approaches

1. Naïve Bayes:
○ Application: Email spam filtering.
○ Description: Calculate probabilities of each email being spam or not based
on word frequencies.
2. Association-Based Classification:
○ Application: Market basket analysis.
○ Description: Use frequent itemsets to predict future purchases based on
current cart contents.
3. Rule-Based Classifier:
○ Application: Customer segmentation.
○ Description: Create rules based on customer attributes to classify them into
segments.

44.Web Mining: Web mining involves extracting useful information and knowledge from web
data, which includes web content, web structure, and web usage data. It can be categorized
into three main types:

i.Web Content Mining: Focuses on extracting useful information from the content of web
pages.ii.Web Structure Mining: Analyzes the structure of hyperlinks within the web to
discover patterns and relationships.iii.Web Usage Mining: Analyzes user interaction data
(e.g., web logs) to understand user behavior and improve web services.
45.Mining the Web Page Layout Structure:

● Objective: Understand and extract meaningful information from the arrangement of

elements on a web page, such as headers, paragraphs, images, and links.
● Techniques:
○ DOM Tree Parsing: The Document Object Model (DOM) represents the
structure of a web page. Mining involves parsing the DOM tree to extract
layout information.
○ XPath/CSS Selectors: Used to navigate and extract specific elements from
the web page.
○ Visual Segmentation: Techniques like VIPS (Vision-based Page
Segmentation) segment a web page into visually distinct blocks to understand
the layout and hierarchy.

46.Mining Web Link Structure:

● Objective: Analyze the hyperlinks between web pages to discover relationships,

patterns, and the overall structure of the web.
● Key Concepts:
○ PageRank: An algorithm used by Google Search to rank web pages in their
search engine results. It measures the importance of web pages based on the
number and quality of links to them.
○ HITS Algorithm: Hyperlink-Induced Topic Search identifies two types of web
pages: hubs (pages that link to many other pages) and authorities (pages that
are linked by many hubs).
○ Graph Theory: Representing the web as a graph, with nodes as web pages
and edges as hyperlinks, to apply graph-based algorithms for analysis.

47.Mining Multimedia Data on the Web:

● Objective: Extract and analyze multimedia content (images, videos, audio) from the
web to derive useful information.
● Techniques:
○ Image Mining: Using techniques like object recognition, image classification,
and clustering to analyze images.
○ Video Mining: Analyzing video content using methods such as scene
detection, keyframe extraction, and activity recognition.
○ Audio Mining: Extracting information from audio content through techniques
like speech recognition, audio classification, and sentiment analysis.

48.Distributed Data Mining (DDM): Distributed Data Mining (DDM) refers to the process
of extracting knowledge and patterns from large datasets distributed across multiple
locations, heterogeneous environments, or decentralised systems. DDM is essential for
handling vast amounts of data generated in various fields such as finance,
telecommunications, healthcare, and e-commerce, where data is often stored in distributed
systems.
49.Automatic Classification of Web Documents:

● Objective: Categorize web documents into predefined classes automatically.

● Techniques:
○ Text Classification Algorithms: Such as Naïve Bayes, Support Vector
Machines (SVM), and Neural Networks.
○ Feature Extraction: Techniques like TF-IDF (Term Frequency-Inverse
Document Frequency) and word embeddings (e.g., Word2Vec, BERT)
to represent the text content.
○ Clustering: Grouping similar documents together using algorithms like
k-means or hierarchical clustering for exploratory analysis.

50.Web Usage Mining:

● Objective: Analyze user interaction data from web logs to understand user
behavior and improve web services.
● Steps:
○ Data Collection: Gathering data from web server logs, browser logs,
user profiles, and cookies.
○ Preprocessing: Cleaning and transforming raw data into a usable
format (e.g., session identification, user identification).
○ Pattern Discovery: Using techniques such as association rule mining,
clustering, and sequential pattern mining to find interesting patterns in
web usage data.
○ Pattern Analysis: Interpreting the discovered patterns to make
informed decisions about website design, content, and marketing
strategies.

Types of Knowledge Discovery in Data Mining .

1. Classification:
○ Purpose: Assign items to predefined categories.
○ Example: Email spam detection.
2. Clustering:
○ Purpose: Group similar items together without predefined categories.
○ Example: Customer segmentation based on buying behavior.
3. Association Rule Learning:
○ Purpose: Discover relationships between variables in large datasets.
○ Example: Market basket analysis to find products often bought
together.
Advantage and Disadvantage of data mart

Advantages of a Data Mart:

1. Improved Performance: Data marts are smaller and more focused than data
warehouses, allowing for faster query responses and better performance for
specific departmental needs.
2. Cost-Effective: Implementing a data mart is generally less expensive than a
full-scale data warehouse. They require fewer resources and infrastructure,
making them a cost-effective solution for smaller projects or departments.

Disadvantages of a Data Mart:

1. Data Silos: Implementing multiple data marts can lead to the creation of data
silos, where data is isolated and not easily shared or integrated across the
organization. This can hinder overall data analysis and decision-making.
2. Inconsistency: Different data marts might use different standards and
definitions, leading to inconsistencies in data interpretation and reporting
across the organization.

Applications of Data Mining

1. Retail and E-commerce:

○ Customer Segmentation: Identify customer groups based on
purchasing behavior to improve marketing strategies.
○ Market Basket Analysis: Determine products frequently bought
together to enhance cross-selling and product placement.
2. Healthcare:
○ Disease Prediction and Diagnosis: Analyze patient data to predict
diseases and improve early diagnosis.
○ Treatment Effectiveness: Evaluate the success of treatments by
analyzing patient outcomes.
3. Finance and Banking:
○ Fraud Detection: Identify unusual transaction patterns that indicate
potential fraud.
○ Risk Management: Assess credit risks by analyzing customer financial
data and payment histories.
Data Warehouse Three-Tier Architecture

Ans:A data warehouse employs a three-tier architecture to efficiently manage data

processing, storage, and access. This architecture consists of the bottom tier, middle
tier, and top tier.

1. Bottom Tier: Data Source Layer

Function: Extracts data from various source systems and prepares it for storage.

Components:

● Data Sources: Operational databases, ERP systems, flat files, and external
sources.
● ETL Processes: Tools that perform Extract, Transform, and Load operations
to cleanse, integrate, and aggregate data before loading it into the data
warehouse.

2. Middle Tier: Data Storage and Management Layer

Function: Stores and manages cleaned and transformed data, supporting efficient
querying and analysis.

Components:

● Data Warehouse Database: Central repository optimized for read-intensive

operations.
● Data Marts: Subsets of the data warehouse tailored for specific departments
or business units.
● OLAP Servers: Online Analytical Processing servers that support complex
queries and multidimensional analysis.

3. Top Tier: Presentation and Analysis Layer

Function: Provides tools for data reporting, analysis, and visualization, enabling
end-users to derive insights from the data warehouse.

Components:

● Query and Reporting Tools: Allow generation of standard and ad-hoc

reports.
● Data Mining Tools: Discover patterns and relationships through statistical
analysis and machine learning.
● Dashboards and Visualization Tools: Offer graphical representations of
data through charts and dashboards for easier interpretation.
Difference between Data mining vs Data warehouse?

Data Mining Data Warehousing

i.Data mining is the process of i.A data warehouse is a database
determining data patterns. system designed for analytics
ii.Data mining is generally considered as ii.Data warehousing is the process of
the process of extracting useful data combining all the relevant data..
from a large set of data. iii.Data warehousing is entirely carried
iii.Business entrepreneurs carry data out by the engineers.
mining with the help of engineers. iv.In data warehousing, data is stored
iv.In data mining, data is analyzed periodically.
repeatedly. v.Data warehousing is the process of
v.Data mining uses pattern recognition extracting and storing data that allow
techniques to identify patterns. easier reporting.

Feature of good cluster

A good cluster in data clustering exhibits several key features:

1. High Intra-cluster Similarity:Instances within the same cluster should be

similar to each other. This means that the distance or similarity measure
between data points within a cluster should be minimized.
2. Low Inter-cluster Similarity:Instances from different clusters should be
dissimilar. This implies that the distance or dissimilarity measure between
clusters should be maximized.
3. Compactness:Clusters should be tightly packed, meaning that data points
within a cluster should be close to each other. This ensures that the cluster
represents a distinct group.

Pre Pruning and Post Pruning approch in classification

Ans:Prepruning: Prepruning involves stopping the tree construction process early,

before it becomes fully grown, based on certain conditions.
Purpose: It prevents the tree from becoming overly complex and capturing noise in
the training data, thus improving its ability to generalize to unseen data.
Example: Setting a maximum depth limit for the tree, limiting the number of leaf
nodes, or requiring a minimum number of instances in a node before further splitting.

Post-pruning:: Post-pruning involves constructing the full decision tree and then
removing or collapsing certain nodes or branches based on pruning criteria.

Purpose: It allows the tree to grow fully and capture all patterns in the training data,
and then simplifies it to improve its performance on unseen data.
Example: Using techniques like reduced-error pruning, cost-complexity pruning,
Association algorithm in data mining

Ans:In data mining, association algorithm is a technique used to discover interesting

relationships or associations among a large set of data items. It's commonly applied
in market basket analysis to uncover patterns in consumer behavior. .

1. Definition : An association algorithm is a computational method used to

uncover patterns of association or co-occurrence among a set of items in a
large dataset.
2. Purpose: It's primarily used for market basket analysis to identify
relationships between items purchased together, which helps in
understanding customer behavior, optimizing product placement, and
designing targeted marketing strategies.
3. Popular algorithms : Common association algorithms include Apriori,
FP-Growth, and Eclat. These algorithms employ different strategies to
efficiently mine associations from large transactional datasets, such as using
candidate generation and pruning techniques.

Six Sigma A Complete Step by Step Guide
100% (2)
Six Sigma A Complete Step by Step Guide
299 pages
Master The PRINCE2 Themes With Pictures
100% (2)
Master The PRINCE2 Themes With Pictures
11 pages
Chapter-2 DM
No ratings yet
Chapter-2 DM
23 pages
Data Warehousing and Data Mining: UNIT-1
No ratings yet
Data Warehousing and Data Mining: UNIT-1
118 pages
6th - SEM Data Science Notes
No ratings yet
6th - SEM Data Science Notes
46 pages
Module-1: Data Warehousing & Modelling
No ratings yet
Module-1: Data Warehousing & Modelling
13 pages
Module-3 Data Warehousing
No ratings yet
Module-3 Data Warehousing
44 pages
CS2202 DataWarehouse OLAP
No ratings yet
CS2202 DataWarehouse OLAP
49 pages
DMDW1
No ratings yet
DMDW1
13 pages
Data Warehousing Concepts Guide
No ratings yet
Data Warehousing Concepts Guide
2 pages
Data Warehousing and On-Line Analytical Processing
No ratings yet
Data Warehousing and On-Line Analytical Processing
40 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
48 pages
UEU Sistem Pendukung Keputusan Pertemuan 5
No ratings yet
UEU Sistem Pendukung Keputusan Pertemuan 5
46 pages
DMW Unit 1
No ratings yet
DMW Unit 1
56 pages
04OLAP Editted v1
No ratings yet
04OLAP Editted v1
59 pages
04OLAP
No ratings yet
04OLAP
66 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
58 pages
Data Warehousing
100% (1)
Data Warehousing
51 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
58 pages
Chap3 PIEAS DCIS BSCIS DM 23 Topic 03 DWH OLAP
No ratings yet
Chap3 PIEAS DCIS BSCIS DM 23 Topic 03 DWH OLAP
46 pages
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-26 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-26 Reference-Material-I
28 pages
Unit 1 - Data Warehouse
No ratings yet
Unit 1 - Data Warehouse
21 pages
DataMining and Data Warehousing
No ratings yet
DataMining and Data Warehousing
96 pages
Data Warehouse - Unit-2 - S
No ratings yet
Data Warehouse - Unit-2 - S
21 pages
DM Module 1
No ratings yet
DM Module 1
16 pages
(2025!04!03) - Data Warehouse - Lecture 3
No ratings yet
(2025!04!03) - Data Warehouse - Lecture 3
41 pages
FDS Unit 2
No ratings yet
FDS Unit 2
21 pages
04DWH & Olap
No ratings yet
04DWH & Olap
50 pages
03-Unit 2
No ratings yet
03-Unit 2
79 pages
02 DataWarehousing and OLAP
No ratings yet
02 DataWarehousing and OLAP
66 pages
04olap New
No ratings yet
04olap New
55 pages
Data Warehouse Concepts & Models
No ratings yet
Data Warehouse Concepts & Models
7 pages
Data Warehouse
No ratings yet
Data Warehouse
23 pages
CH 1
No ratings yet
CH 1
53 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
70 pages
Fundamentals of Data Science Notes (Module - 2)
No ratings yet
Fundamentals of Data Science Notes (Module - 2)
11 pages
04OLAP
No ratings yet
04OLAP
58 pages
UNIT-1 (RIT-062) : Data Warehousing
No ratings yet
UNIT-1 (RIT-062) : Data Warehousing
34 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Data Warehousing and Management Midterms
No ratings yet
Data Warehousing and Management Midterms
9 pages
DM Chapter 4
No ratings yet
DM Chapter 4
8 pages
Unit 1
No ratings yet
Unit 1
99 pages
Data Warehousing Guide for IT Students
No ratings yet
Data Warehousing Guide for IT Students
77 pages
DATA WAREHOUSE Basic Concepts
No ratings yet
DATA WAREHOUSE Basic Concepts
26 pages
04OLAP
No ratings yet
04OLAP
50 pages
Data Warehousing Unit 1,2
No ratings yet
Data Warehousing Unit 1,2
9 pages
Data Mining 4
No ratings yet
Data Mining 4
59 pages
Warehouse
No ratings yet
Warehouse
60 pages
Data Warehousing Introduction Pages 2 53
No ratings yet
Data Warehousing Introduction Pages 2 53
52 pages
04OLAP
100% (1)
04OLAP
58 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
26 pages
Informatica FAQs
No ratings yet
Informatica FAQs
143 pages
HTCB Unit 1
No ratings yet
HTCB Unit 1
5 pages
Data Warehousing for Analysts
No ratings yet
Data Warehousing for Analysts
40 pages
02datawarehousing For DM
No ratings yet
02datawarehousing For DM
38 pages
Chapter 1
No ratings yet
Chapter 1
9 pages
DWDM Lecture Notes U-1
No ratings yet
DWDM Lecture Notes U-1
11 pages
OLAP and Data Mining
No ratings yet
OLAP and Data Mining
27 pages
Data Mining - 1.
No ratings yet
Data Mining - 1.
34 pages
Cao Wang FTA EMA
No ratings yet
Cao Wang FTA EMA
5 pages
HUAWEI MateView GT Quick Start Guide - (01, En-Us, Zhuque)
No ratings yet
HUAWEI MateView GT Quick Start Guide - (01, En-Us, Zhuque)
41 pages
Bucket Bag
100% (1)
Bucket Bag
8 pages
Reading Unit 4
No ratings yet
Reading Unit 4
3 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
73 1st Long Problem Set
No ratings yet
73 1st Long Problem Set
11 pages
CS502 Mcqs MidTerm by Vu Topper RM
No ratings yet
CS502 Mcqs MidTerm by Vu Topper RM
45 pages
TC74VHC240F, TC74VHC240FK TC74VHC244F, TC74VHC244FK
No ratings yet
TC74VHC240F, TC74VHC240FK TC74VHC244F, TC74VHC244FK
10 pages
Consumer Perception Towards Online Grocery Stores, Chennai
No ratings yet
Consumer Perception Towards Online Grocery Stores, Chennai
14 pages
SAP MM - Purchase Info Record
100% (1)
SAP MM - Purchase Info Record
6 pages
W5 Iccii Lab Physical Synthesis
No ratings yet
W5 Iccii Lab Physical Synthesis
16 pages
Question 1) Briefly Explain Capital Allocation Process With The Help of Diagram?
No ratings yet
Question 1) Briefly Explain Capital Allocation Process With The Help of Diagram?
7 pages
Adidas'S Sustainability Story
No ratings yet
Adidas'S Sustainability Story
12 pages
UOP Alkylation Technologies Overview
No ratings yet
UOP Alkylation Technologies Overview
1 page
Chapter 1-3
No ratings yet
Chapter 1-3
53 pages
Legal Frameworks & Judicial Notice
100% (1)
Legal Frameworks & Judicial Notice
7 pages
STD Blanket MSDS FOR TURBINE INSULATION
No ratings yet
STD Blanket MSDS FOR TURBINE INSULATION
7 pages
COP WFP CHK 01 2013 v1 All Checklists
100% (1)
COP WFP CHK 01 2013 v1 All Checklists
47 pages
School Space Allocation Guide
No ratings yet
School Space Allocation Guide
5 pages
Am 1370260123
No ratings yet
Am 1370260123
1 page
Cineplex Loyalty Program Strategy
No ratings yet
Cineplex Loyalty Program Strategy
10 pages
Presented By: Nosheen Mehfooz M.Awais Anum Aziz M.Shayan S. Hammad S. Rameez Khalid
No ratings yet
Presented By: Nosheen Mehfooz M.Awais Anum Aziz M.Shayan S. Hammad S. Rameez Khalid
19 pages
Kuwait's Growing F&B Market
No ratings yet
Kuwait's Growing F&B Market
2 pages
How To Post Bail For Your Temporary Liberty
No ratings yet
How To Post Bail For Your Temporary Liberty
4 pages
Impacts of The World Recession and Economic Crisis On Tourism North America
No ratings yet
Impacts of The World Recession and Economic Crisis On Tourism North America
11 pages
Configuring The Network Settings
No ratings yet
Configuring The Network Settings
23 pages
Prospectus 2022 Fall
No ratings yet
Prospectus 2022 Fall
111 pages
Notation
No ratings yet
Notation
9 pages