Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
121 views23 pages

Lecture 1 - Data Mining 101

This document provides an overview of data mining concepts from a lecture. It defines data mining as the process of discovering patterns in large amounts of data. It describes the typical steps in the knowledge discovery process including data cleaning, integration, selection, transformation, mining, evaluation, and presentation. It outlines the different types of data that can be mined, including database, data warehouse, transactional, and other structured and unstructured data. It also discusses the various types of patterns that can be mined, such as frequent patterns, associations, correlations, classification models, clustering, and outliers. Finally, it briefly introduces some common data mining technologies like statistics, machine learning, and database and data warehouse systems.

Uploaded by

Reymar Ventura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views23 pages

Lecture 1 - Data Mining 101

This document provides an overview of data mining concepts from a lecture. It defines data mining as the process of discovering patterns in large amounts of data. It describes the typical steps in the knowledge discovery process including data cleaning, integration, selection, transformation, mining, evaluation, and presentation. It outlines the different types of data that can be mined, including database, data warehouse, transactional, and other structured and unstructured data. It also discusses the various types of patterns that can be mined, such as frequent patterns, associations, correlations, classification models, clustering, and outliers. Finally, it briefly introduces some common data mining technologies like statistics, machine learning, and database and data warehouse systems.

Uploaded by

Reymar Ventura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Lecture 1 – Data Mining 101

[email protected] | Department of Computing and Information Sciences


Outline
• What is Data Mining?
• What Kinds of Data can be Mined?
• What Kinds of Patterns can be Mined?
• Data Mining Technologies
• Data Mining Applications
• Major Issues in Data Mining
What is Data Mining?

• Data mining is the process of discovering interesting patterns and knowledge


from large amounts of data.
• The data sources can include databases, data warehouses, the web, other
information repositories, or data that are streamed into the system dynamically

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
What is Data Mining?
• The Knowledge Discovery Process is an iterative
sequence of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be
combined)
3. Data selection (where data relevant to the analysis
task are retrieved from the database)
4. Data transformation (where data are transformed and
consolidated into forms appropriate for mining by
performing summary or aggregation operations)
5. Data mining (an essential process where intelligent
methods are applied to extract data patterns)
6. Pattern evaluation (to identify the truly interesting
patterns representing knowledge based on
interestingness measures)
7. Knowledge presentation (where visualization and
knowledge representation techniques are used to
present mined knowledge to users)
Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
What Kinds of Data Can Be Mined?
• As a general technology, data mining can be applied to any kind of data as
long as the data are meaningful for a target application.
• Basic forms of data:
• Database Data
• Data Warehouse Data
• Transactional Data
• Other forms of data:
• Data Streams
• Ordered/Sequence Data
• Graph or Networked Data
• Spatial data
• Text Data
• Multimedia Data
• Web Data
Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
What Kinds of Data Can Be Mined?
• Database Data
• A database system, also called a database management system (DBMS), consists of a
collection of interrelated data, known as a database, and a set of software programs to
manage and access the data.
• A relational database is a collection of tables, each of which is assigned a unique name. Each
table consists of a set of attributes (columns or fields) and usually stores a large set of tuples
(records or rows).

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
What Kinds of Data Can Be Mined?
• Database Warehouses
• A Data Warehouse is a repository of information collected from multiple sources, stored
under a unified schema, and usually residing at a single site.
• Data warehouses are constructed via a process of data cleaning, data integration, data
transformation, data loading, and periodic data refreshing.
• A data warehouse is usually modeled by a multidimensional data structure, called a data
cube, in which each dimension corresponds to an attribute or a set of attributes in the
schema, and each cell stores the value of some aggregate measure.

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
What Kinds of Data Can Be Mined?
• Database Warehouses

Typical Data Warehouse Framework Multidimensional Data Cube

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
What Kinds of Data Can Be Mined?
• Transactional Data
• Each record in a transactional database captures a transaction, such as a customer’s
purchase, a flight booking, or a user’s clicks on a web page.
• A transaction typically includes a unique transaction identity number (trans ID) and a list of
the items making up the transaction, such as the items purchased in the transaction.

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
What Kinds of Patterns Can Be Mined?
• Data Characterization
• Data Characterization is a summarization of the general characteristics or features of a target class of
data.
• The output of data characterization can be presented using pie charts, bar charts, curves,
multidimensional data cubes, and multidimensional tables, including crosstabs.
• The resulting descriptions can also be presented as generalized relations or in rule form (called
characteristic rules).
• Data Discrimination
• Data discrimination is a comparison of the general features of the target class data objects against the
general features of objects from one or multiple contrasting classes.
• The target and contrasting classes can be specified by a user, and the corresponding data objects can be
retrieved through database queries.

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
What Kinds of Patterns Can Be Mined?
• Frequent Patterns, Associations and Correlations
• Frequent patterns, as the name suggests, are patterns that occur frequently in data.
• There are many kinds of frequent patterns, including frequent itemsets, frequent subsequences (also
known as sequential patterns), and frequent substructures.

where 𝑋 is a variable representing a customer. A confidence, or certainty, of 50% means that if a customer
buys a computer, there is a 50% chance that she will buy software as well. A 1% support means that 1% of
all the transactions under analysis show that computer and software are purchased together.

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
What Kinds of Patterns Can Be Mined?
• Factors (Predictive Analysis)
• Classification is the process of finding a model (or function) that describes and distinguishes data
classes or concepts.
• The model are derived based on the analysis of a set of training data (i.e., data objects for which the
class labels are known).
• The model is used to predict the class label of objects for which the class label is unknown.

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
What Kinds of Patterns Can Be Mined?
• Factors (Predictive Analysis)
• Regression models continuous-valued functions.
• This is used to predict missing or unavailable numerical data values rather than (discrete) class labels.
• The term prediction refers to both numeric prediction and class label prediction.
• Regression analysis is a statistical methodology that is most often used for numeric prediction, although
other methods exist as well.
• Regression also encompasses the identification of distribution trends based on the available data.

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
What Kinds of Patterns Can Be Mined?
• Groups (Cluster Analysis)
• Unlike classification and regression, which
analyze class-labeled (training) data sets,
clustering analyzes data objects without
consulting class labels.
• In many cases, class labeled data may simply
not exist at the beginning. Clustering can be
used to generate class labels for a group of
data.
• The objects are clustered or grouped based on
the principle of maximizing the intraclass
similarity and minimizing the interclass
similarity.

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
What Kinds of Patterns Can Be Mined?
• Abnormalities (Outlier Analysis)
• A data set may contain objects that do not
comply with the general behavior or model of
the data.
• These data objects are outliers. Many data
mining methods discard outliers as noise or
exceptions.
• However, in some applications (e.g., fraud
detection) the rare events can be more
interesting than the more regularly occurring
ones.
• The analysis of outlier data is referred to as
outlier analysis or anomaly mining.

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
Data Mining Technologies
• Statistics
• Statistics studies the collection, analysis,
interpretation or explanation, and
presentation of data.
• Data mining has an inherent connection with
statistics.
• A statistical model is a set of mathematical
functions that describe the behavior of the
objects in a target class in terms of random
variables and their associated probability
distributions.

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
Data Mining Technologies
• Machine Learning
• Machine learning investigates how computers
can learn (or improve their performance) based
on data.
• A main research area is for computer programs
to automatically learn to recognize complex
patterns and make intelligent decisions based
on data.

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
Data Mining Technologies
• Database System and Data Warehouse
• Database systems research focuses on the
creation, maintenance, and use of databases
for organizations and end-users.
• Database systems researchers have established
highly recognized principles in data models,
query languages, query processing and
optimization methods, data storage, and
indexing and accessing methods.
• Recent database systems have built systematic
data analysis capabilities on database data
using data warehousing and data mining
facilities.
• A data warehouse integrates data originating
from multiple sources and various timeframes.

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
Data Mining Technologies
• Information Retrieval
• Information retrieval (IR) is the science of
searching for documents or information in
documents.
• Documents can be text or multimedia, and
may reside on the Web.
• The differences between traditional
information retrieval and database systems are
twofold:
• the data under search are unstructured;
• the queries are formed mainly by keywords, which do
not have complex structures (unlike SQL queries in
database systems)

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
Data Mining Technologies
• Information Retrieval
• Information retrieval (IR) is the science of
searching for documents or information in
documents.
• Documents can be text or multimedia, and
may reside on the Web.
• The differences between traditional
information retrieval and database systems are
twofold:
• the data under search are unstructured;
• the queries are formed mainly by keywords, which do
not have complex structures (unlike SQL queries in
database systems)

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
Data Mining Applications
• Where there are data, there are data mining applications.
• In Business intelligence (BI), technologies provide historical, current, and predictive views of business
operations.
• A Web search engine is a specialized computer server that searches for information on the Web. The search
results of a user query are often returned as a list (sometimes called hits).
• In Financial Analysis, the banking and finance industry relies on high-quality, reliable data. In loan markets,
financial and user data can be used for a variety of purposes, like predicting loan payments and determining
credit ratings.
• Network resources can face threats and actions that intrude on their confidentiality or integrity. Therefore,
detection of intrusion (Intrusion Detection) has emerged as a crucial data mining practice.
• In Biological data mining practices are common in genomics, proteomics, and biomedical research. From
characterizing patients’ behavior and predicting office visits to identifying medical therapies for their
illnesses, data science techniques provide multiple advantages.

Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
Major Issues in Data Mining
• Mining Methodology
• Researchers have been vigorously developing new data mining methodologies. This involves the investigation of new kinds of
knowledge, mining in multidimensional space, integrating methods from other disciplines, and the consideration of semantic ties
among data objects.

• User Interaction
• The user plays an important role in the data mining process. Interesting areas of research include how to interact with a data mining
system, how to incorporate a user’s background knowledge in mining, and how to visualize and comprehend data mining results.

• Efficiency and Scalability


• Efficiency and scalability are always considered when comparing data mining algorithms. As data amounts continue to multiply, these
two factors are especially critical.

• Diversity of Database Types


• The wide diversity of database types brings about challenges to data mining.

• Data Mining and Society


• With data mining penetrating our everyday lives, it is important to study the impact of data mining on society.
• Data mining will help scientific discovery, business management, economy recovery, and security protection (e.g., the real-time
discovery of intruders and cyberattacks). However, it poses the risk of disclosing an individual’s personal information (Privacy-
Preserving Data Mining).
Han, J., Kamber, M., and Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. Elsevier.
Lecture 1 – Data Mining 101
[email protected] | Department of Computing and Information Sciences

You might also like