Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views27 pages

Module1 1 Introduction

The document outlines a syllabus for a course on Data Mining and Warehousing, covering topics such as data preprocessing, association rules, classification, clustering, and data mining functionalities. It emphasizes the importance of data mining in extracting knowledge from large datasets and includes references to core texts and additional resources. The document also details the stages of the knowledge discovery process and various types of data that can be mined.

Uploaded by

donmathew666666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views27 pages

Module1 1 Introduction

The document outlines a syllabus for a course on Data Mining and Warehousing, covering topics such as data preprocessing, association rules, classification, clustering, and data mining functionalities. It emphasizes the importance of data mining in extracting knowledge from large datasets and includes references to core texts and additional resources. The document also details the stages of the knowledge discovery process and various types of data that can be mined.

Uploaded by

donmathew666666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

CP1444: DATA MINING &

WAREHOUSING
SYLLABUS
Module I:
Introduction-: Introduction: -Data, Information, Knowledge, KDD, types of data for mining,
technologies for mining, issues in data mining, data mining functionalities/tasks. Data pre-
processingoverview, Data cleaning, Data integration, Data reduction, Data transformation and
discretization. Data Warehouses-basic concepts, Data Mart, Databases Vs Data warehouses, Data
ware houses Vs Data mart, OLTP Vs OLAP, OLAP operations/functions, OLAP Multi-Dimensional
Models- Data cubes, Star, Snow Flakes, Fact constellation data models.
Module II:
Association rules- Market Basket Analysis, Frequent Item sets, Closed Item sets, and Association
Rules, Frequent Item sets Mining Methods- Apriori Algorithm: Finding Frequent Itemset by
Confined Candidate Generation, Generating Association Rules from Frequent item sets, Improving
the Efficiency of Apriori.

2
Module III:

Classification– Basic Concepts, Decision Tree Induction, Bayesian Classification, Rule Based
Classification, Classification by Back propagation, Support Vector Machines, Associative Classification,
Lazy Learners

Module IV:

Clustering- Cluster analysis: definition and Requirements, Characteristics of clustering techniques, Types
of data in cluster analysis, Overview of Basic Clustering Methods, Partitioning methodsK-Means and K -
medoid methods, Outlier detection- definition and types of outliers, Outlier Detection Methods-
Supervised, Semi-Supervised, and Unsupervised Methods, Statistical Methods, Proximity-Based
Methods, and Clustering-Based Methods (basic concepts only)

3
CORE TEXT

1. Jiawei Han & Micheline Kamber & Jian Pei Data Mining Concepts &
Techniques

https://myweb.sabanciuniv.edu/rdehkharghani/files/2016/02/The-Morgan-Ka
ufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Kamber-Ji
an-Pei-Data-Mining.-Concepts-and-Techniques-3rd-Edition-Morgan-Kaufmann-
2011.pdf

ADDITIONAL REFERENCES

2. Sunitha Tiwari & Neha Chaudary, Data Mining and Warehousing, Dhanpat
Rai & Co. 4
Reference :
Chapter 1

https://hanj.cs.illinois.edu/bk3/
bk3_slidesindex.htm

5
WHAT IS DATA MINING?

• To refer to the mining of gold from rocks or sand, we say gold mining
instead of rock or sand mining.
• Analogously, data mining should have been more appropriately named
“knowledge mining from data,” which is unfortunately somewhat long.
• Data mining refers to extracting or mining knowledge from large
amounts of data.

6
WHAT IS DATA MINING?

• It is the computational process of discovering patterns in large data sets involving


methods at the intersection of artificial intelligence, machine learning, statistics, and
database systems.
• The overall goal of the data mining process is to extract information from a data set
and transform it into an understandable structure for further use.
• The key properties of data mining are
• Automatic discovery of patterns
• Prediction of likely outcomes
• Creation of actionable information
• Focus on large datasets and databases

7
Data

• Data is a collection of facts, such as numbers, words, measurements,


observations, or just descriptions of things.
• Data can be qualitative or quantitative.
• Qualitative data is descriptive information (it describes something)
• Quantitative data is numerical information (numbers)

8
Information
• The raw data is collected, and after processing this raw data, the
outcome is information.
• This information can be defined as when the data is processed,
organized, and presented in a specific context to serve its use is
called information.
• The information doesn’t have any existence without data, most
information has to measure units like quantity, time, etc. There are
also a lot of differences between data and information. For
information to be useful, the process data has the following
characteristics which are:

9
• Time – Information should be available at any point in time whenever it is
required.
• Accuracy – Information should be actual and organized only then it can serve
its purpose.
• Completeness – Information should be finite and consistent.

• Some examples of information :


1. Information about transportation systems such as train schedules.
2. Geographical information such as direction.
3. Payslips
4. Bank passbook
5. Printed documents.

10
Knowledge
• Knowledge is information that has been processed, organized, or
structured in some way, or put into practice in some way.
• Knowledge means the familiarity and awareness of a person, place,
events, ideas, issues, ways of doing things or anything else, which is
gathered through learning, perceiving or discovering.
• It is the state of knowing something with cognizance through the
understanding of concepts, study and experience.

11
Why we need Data Mining?

Volume of information is increasing everyday that we can handle from business


transactions, scientific data, sensor data, Pictures, videos, etc. So, we need a system that will
be capable of extracting essence of information available and that can automatically
generate report,views or summary of data for better decision-making.

Why Data Mining is used in Business?

Data mining is used in business to make better managerial decisions by:

1. Automatic summarization of data


2. Extracting essence of information stored.
3. Discovering patterns in raw data.
12
KNOWLEDGE DISCOVERY FROM DATA, OR KDD

• KDD (Knowledge Discovery in Databases) is a process that involves


the extraction of useful, previously unknown, and potentially
valuable information from large datasets.

• The KDD process in data mining typically involves the following


steps:

13
KNOWLEDGE DISCOVERY FROM DATA, OR KDD
different I. Data cleaning (to remove noise and inconsistent data)
forms of
data II. Data integration (where multiple data sources may be combined)
preprocessi III. Data selection (where data relevant to the analysis task are retrieved from the database)
ng
IV. Data transformation (where data are transformed and consolidated into forms appropriate
for mining by performing summary or aggregation operations)
V. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
VI. Pattern evaluation (to identify the truly interesting patterns representing knowledge based
on interestingness measures)
VII. Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users)

The term data mining is often used to refer to the entire knowledge discovery process
14

PYQ: What is data mining? Outline the stages in the knowledge discovery process. (1
15
1. Selection: Select a relevant subset of the data for analysis.

2. Pre-processing: Clean and transform the data to make it ready for analysis. This may include tasks such
as data normalization, missing value handling, and data integration.

3. Transformation: Transform the data into a format suitable for data mining, such as a matrix or a graph.

4. Data Mining: Apply data mining techniques and algorithms to the data to extract useful information and
insights. This may include tasks such as clustering, classification, association rule mining, and anomaly
detection.

5. Interpretation: Interpret the results and extract knowledge from the data. This may include tasks such as
visualizing the results, evaluating the quality of the discovered patterns and identifying relationships and
associations among the data.

6. Evaluation: Evaluate the results to ensure that the extracted knowledge is useful, accurate, and
meaningful.

7. Deployment: Use the discovered knowledge to solve the business problem and make decisions.
16
WHAT KINDS OF DATA CAN BE MINED?

• The most basic forms of data for mining applications are database
data, data warehouse data and transactional data.

• Data mining can also be applied to other forms of data (e.g., data
streams, ordered/sequence data, graph or networked data, spatial
data, text data, multimedia data, and the WWW).

17
WHAT KINDS OF DATA CAN BE MINED?

1. Database Data
• A database system, also called a database management system (DBMS), consists of a
collection of interrelated data, known as a database, and a set of software programs to
manage and access the data.
• A relational database is a collection of tables, each of which is assigned a unique name.
Each table consists of a set of attributes (columns or fields) and usually stores a large set
of tuples (records or rows).
• When mining relational databases, we can go further by searching for trends or data
patterns. For example, data mining systems can analyze customer data to predict the
credit risk of new customers based on their income, age, and previous credit
information.
18
WHAT KINDS OF DATA CAN BE MINED?

II. Data Warehouses

• A data warehouse is a repository of information collected from multiple sources,


stored under a unified schema, and usually residing at a single site.
• Data warehouses are constructed via a process of data cleaning, data integration,
data transformation, data loading, and periodic data refreshing.
• A data warehouse is usually modeled by a multidimensional data structure, called a
data cube, in which each dimension corresponds to an attribute or a set of attributes
in the schema, and each cell stores the value of some aggregate measure such as
count or sum(sales amount ).
• A data cube provides a multidimensional view of data and allows the
precomputation and fast access of summarized data. 19
20
WHAT KINDS OF DATA CAN BE MINED?

III. Transactional Data

• Each record in a transactional database captures a transaction, such as a customer’s purchase, a flight booking,
or a user’s clicks on a web page.
• A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up
the transaction, such as the items purchased in the transaction.

IV. Other Kinds of Data


• Time-related or sequence data (e.g., historical records, stock exchange data, and time-series and biological
sequence data),
• data streams (e.g., video surveillance and sensor data, which are continuously transmitted),
• spatial data (e.g., maps),
• engineering design data (e.g., the design of buildings, system components, or integrated circuits),
• hypertext and multimedia data (including text, image, video, and audio data),
• graph and networked data (e.g., social and information networks), and
21
• the Web (a huge, widely distributed information repository made available by the Internet).
WHAT KINDS OF PATTERNS CAN BE MINED?
DATA MINING FUNCTIONALITIES (ESSAY)
1. characterization and discrimination
2. the mining of frequent patterns, associations, and correlations
3. classification and regression
4. clustering analysis and
5. outlier analysis

• Data mining functionalities are used to specify the kinds of patterns to be found in data
mining tasks.
• In general, such tasks can be classified into two categories: descriptive and predictive.
• Descriptive mining tasks characterize properties of the data in a target data set.
• Predictive mining tasks perform induction on the current data in order to make predictions.
22
DATA MINING FUNCTIONALITIES

1. Class/Concept Description: Characterization and Discrimination

• Data entries can be associated with classes or concepts.


• For example, classes of items for sale include computers and printers, and concepts
of customers include bigSpenders and budgetSpenders.
• It can be useful to describe individual classes and concepts in summarized, concise,
and yet precise terms. Such descriptions of a class or a concept are called
class/concept descriptions.
• These descriptions can be derived using
• Data Characterization − This refers to summarizing data of class under study. This class under study is called as
Target Class.
• Data Discrimination − It refers to the mapping or classification of a class with some predefined group or class.
23
DATA MINING FUNCTIONALITIES

II. Mining Frequent Patterns, Associations, and Correlations


• Frequent patterns, are patterns that occur frequently in data.
• There are many kinds of frequent patterns, including
• frequent item sets
• frequent sub sequences
• frequent substructures.
• A frequent itemset typically refers to a set of items that often appear together in a transactional
data set—for example, milk and bread, which are frequently bought together in grocery stores by
many customers.
• A frequently occurring subsequence, such as the pattern that customers, tend to purchase first a
laptop, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern.
• A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that may be
combined with itemsets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern. 24
DATA MINING FUNCTIONALITIES

III. Classification and Regression for Predictive Analysis

Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts.

The model are derived based on the analysis of a set of training data (i.e., data objects for which the
class labels are known). The model is used to predict the class label of objects for which the class label
is unknown.

Regression is used to predict missing or unavailable numerical data values


rather than (discrete) class labels.

Regression analysis is a statistical methodology that is most often used for


numeric prediction

25
DATA MINING FUNCTIONALITIES

IV. Cluster Analysis

The objects are clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity.

That is, clusters of objects are formed so that


objects within a cluster have high similarity in
comparison to one another, but are rather
dissimilar to objects in other clusters

26
DATA MINING FUNCTIONALITIES

V. Outlier Analysis

A data set may contain objects that do not comply with the general behavior
or model of the data.
These data objects are outliers.
Many data mining methods discard outliers as noise or exceptions.
In some applications (e.g., fraud detection) the rare events can be more
interesting than the more regularly occurring ones. The analysis of outlier data
is referred to as outlier analysis or anomaly mining

Eg: Outlier analysis may uncover fraudulent usage of credit cards by detecting
purchases of unusually large amounts for a given account number in
comparison to regular charges incurred by the same account. 27

You might also like