Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views62 pages

Session 1 - Getting To Know Data

The document provides an overview of data mining, including its definition, applications, and the knowledge discovery process. It discusses various types of data that can be mined, the technologies used, and the fields that benefit from data mining. Additionally, it covers data attributes, statistical descriptions, and visualization techniques essential for understanding and analyzing data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views62 pages

Session 1 - Getting To Know Data

The document provides an overview of data mining, including its definition, applications, and the knowledge discovery process. It discusses various types of data that can be mined, the technologies used, and the fields that benefit from data mining. Additionally, it covers data attributes, statistical descriptions, and visualization techniques essential for understanding and analyzing data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Session 1: Getting to know data

ITEC5310- DATA MINING

Some pictures are copied from 02 text books:


Jiawei Han, Micheline Kamber, Jian Pei, Data Mining: Concepts and Techniques, 3rd Edition, Elsevier, 2012.
Max Bramer, Principles of data mining, Springer, 2007.
CONTENTS:
• Part 1: What is Data Mining (DM).
• General concepts.
• Applications of DM.
• Data Mining and Knowledge Discovery Processing (KDP).
• Problems in DM

• Part 2: Overviewing about your data.


• Dataset and features.
• Measuring data values.
• Data Visualization.
• Similarity/Dissimilarity (DIstance) between data points.
Part 1: What is Data Mining?
Why Data Mining?
• “We are living in the information age”
• For example:
• Wal-Mart, handle hundreds of millions of transactions per week at
thousands of branches around the world.
• The medical and health industry generates tremendous amounts of data
from medical records, patient monitoring, and medical imaging.
• Communities and social media have become increasingly important data
sources, producing digital pictures and videos, blogs, Web communities,
and various kinds of social networks.
• Data mining turns a large collection of data into knowledge.
What is Data Mining
• Can be defined in different ways!
• Data mining—searching for knowledge (interesting patterns) in
data.(knowledge mining from data)
• DM as merely an essential step in the process of knowledge
discovery (KDP)
• KDP has many steps (varied based on KDP model)
• Defined Problems
• Data collecting
• Pre-processing data
• Data Mining
• ………
KDP and DM
1. Data cleaning (to remove noise and
inconsistent data)
2. Data integration (where multiple data
sources may be combined)
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentation

Steps 1 through 4 are different forms of


data preprocessing
What kind of data can be mined?
• Database Data: collection of interrelated data and a set of software
programs to manage and access the data.
• RDBMS: collection of tables .
• Table consists of a set of attributes, and stores a large records or rows).
• A data warehouse: a repository of information collected from multiple
sources, stored under a unified schema, and usually residing at a
single site. Data warehouses are constructed via a process of data
cleaning, data integration, data transformation,...
• Transactional Data: customer’s purchase, a flight booking, a user’s
clicks on a web page. A transactional database may have additional
tables, which contain other information related to the transactions
• and Other Kinds of Data ???
What kind of data can be mined?
•Other Kinds of Data:
•Time-related or sequence data (historical records, stock exchange data, and time-series and
biological sequence data)
•Data streams (video surveillance and sensor data, which are continuously transmitted)
•Spatial data (e.g., maps)
•Engineering design data (e.g., the design of buildings, system components, or integrated
circuits)
•Hypertext and multimedia data (including text, image, video, and audio data)
•Graph and networked data (e.g., social and information networks)
•Web data (a huge, widely distributed information repository made available by the Internet).
•…
 Bring challenges:
Handle data carrying special structures (e.g., sequences, trees, graphs, and
networks)
Specific semantics (such as ordering, image, audio and video contents, and
connectivity)
Mine patterns that carry rich structures and semantics.
Weather data set Sensors Data Set

Twitter data set


Which Technologies Are Used?
Which Technologies Are Used?
• Statistic knowledge
• Discription statistics
• Refression, Evaluating, …
• Machine Learning knowledge.
• Supervised learmning, Unsupervised learning,….
• Database knowledge.
• Database systems, Data warehouses
• Visualization Skill
• Algorithms, Applications: Weka, OLAP, Python,...
• Information Retrieval skill
• …….
What Fields Can Use Data Mining?
• A lot of fields!!!!
• Almost of society, enonomics, health, sport…:
• Finance: Calculating credit loans, credit limits.
• Insurance: Building health insurance models, accidents, educational
investments,...
• Sports: Heat charts, Detecting tactics,...
• Social field: anonymous works, determining the date/time of works,...
• Social Security: detect serial killers, predict and recognize criminal,
criminal psychology analysis
• Business Intelligence: Detect trends, personalities, opinions, etc. of
individuals and groups of people.
What Kinds of Patterns Can Be Mined?
• Class/Concept Description: Characterization and Discrimination
• Mining Frequent Patterns, Associations, and Correlations
• Classification and Regression for Predictive Analysis
• Cluster Analysis
Part 2: Getting to know data
Data Objects and Attribute Types
• Data sets
• Data set: are made up of data objects.
• A data object represents an entity
• A sales database: customers, store items, and sales;
• A medical database: patients;
• A university database, may be students, professors, and courses.
• Data objects are typically described by attributes.
• Data objects can also be referred to as samples, examples, instances,
data points, or objects.
• Attributes
• An attribute is a data field, representing a characteristic or feature of a
data object.
• Attribute, dimension, feature, variable: are often used interchangeably
• The type of an attribute is determined by the set of possible values—
nominal, binary, ordinal, or numeric
Type of attribute
• Nominal Attribute
• Means “relating to names”
• The values of a nominal attribute are symbols or names of things.
• Each value represents some kind of category, code, or state, and so nominal
attributes are also referred to as categorical. The values do not have any
meaningful order.
• Hair color: {black, brown, blond, red, auburn, gray, white}.
• Marital status: {single, married, divorced, widowed}
• Occupation: { teacher, dentist, programmer, farmer, ….}
• Notice:
• Do not have any meaningful order about them and are not quantitative
• No sense to find the mean (average) value or median (middle) value
• But can calculate the mode, is one of the measures of central tendency..
Type of attribute (cont.)
• Binary Attribute
• a kind of nominal attribute with only two categories or states: 0 or 1
• Yes hay No.
• 0 or 1.
• Positive or Negative.
• Different with Binary attribute!
• Symmetric: Meaning and the important of 2 values have the same
• Asymmetric: Meaning and the important of 2 values not the same.
• Ex. Symmetric Binary Attribute: Gender (Male/Female).
• Ex. Asymmetric Binary Attribute: HIV test (Positive/Negative), Will be
heavy rain (Yes/No).
 By convention: the most important outcome, which is usually the
rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV
negative).
Type of attribute (cont.)
• Ordinal Attribute
• An attribute with possible values that have a meaningful order or ranking
among them, but the magnitude between successive values is not known.
• Example:
• Size: very tiny tiny, small, middle, big, huge.
• Income: low, middle low, middle, high middle, high, very high.
• …
• In many cases: after “discretize” from numeric attributes.
• The central tendency of an ordinal attribute can be represented by its mode and
its median (the middle value in an ordered sequence), but the mean cannot be
defined.
• Note that nominal, binary, and ordinal attributes are qualitative. That is, they
describe a feature of an object without giving an actual size or quantity.
Type of attribute (cont.)
• Numeric Attribute
• a measurable quantity, represented in integer or real values.
• Range: Integer/Real.
• Has 2 forms: Interval-scale and Ratio-scale.
• Interval-scale:
• Are measured on a scale of equal-size units.
• The values of interval-scaled attributes have order and can be positive,
0, or negative.
• Allow us to compare and quantify the difference between values.
• Ratio-scale:
• we can also compute the difference between values, as well as the
mean, median, and mode:
• Temparature with Kelvin measurement ( 0 K = -273.15 C).
Basic Statistical Descriptions of Data
• Measuring the Central Tendency: Mean, Median, and Mode
• Mean: average value.
• Weight Arithmetic Mean: average with weighted arithmatic.
• Mean always the best way of measuring the center of the data: sensitivity to
extreme (e.g., outlier) values
• Median: middle value in a set of ordered data values.
• Order ascendant set of values (n values).
• N: odd number?.
• N: even number?
• Mode: value that occurs most frequently in the set
• Midrange: average of the largest and smallest values in the set
Basic Statistical Descriptions of Data (cont.)
• Example:
• Give a set of values: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
• n=12 (12 numbers, ascending sorting)
• Mean:

• Median: n even  (n6+n7)/2 = (52+56)/2= 54


• Mode: 2 vales 52 and 70 (appear 2 times).
• Midrange: (Min + Max)/2 = (30+110)/2 = 70.
Basic Statistical Descriptions of Data (cont.)
• Measuring the Dispersion of Data: Range, Quartiles, Variance,
Standard Deviation, and Interquartile Range.
• Range: difference between the largest and smallest values Max – Min.
• Quantiles: points taken at regular intervals of a data distribution, dividing it into
essentially equalsize consecutive sets
• The 2-quantile: is the data point dividing the lower and upper halves of the data distribution
It corresponds to the median
• The 4-quantiles are the three data points that split the data distribution into four equal
parts; each part represents one-fourth of the data distribution
• 100-quantiles: percentiles

• IRQ(Interquartile)= Q3-Q1
Basic Statistical Descriptions of Data (cont.)
• Example:
• Give set of values: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
• n=12 (12 numbers, ascending sorting)
• Quartiles:
• 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.

• Range: 110 – 30 = 80.


• Q1= 47, Q2= 52, Q3= 63.
• IRQ = Q3 – Q1 = 63 – 47 = 16.
• Q2= 52 :  Median!! (Because n=12 even  Median = (52+60)/2 = 54
Basic Statistical Descriptions of Data (cont.)
• The five-number summary: a distribution consists of the median (Q2),
the quartiles Q1 and Q3, and the smallest and largest individual
observations, written in the order of Minimum, Q1, Median, Q3, Maximum.

 BoxPlot: popular way of visualizing a


distribution. A boxplot incorporates the five-number
summary as follows:
• The ends of the box are at the quartiles so
that the box length is the interquartile range.
• The median is marked by a line within the
box.
• Two lines (called whiskers) outside the box
extend to the smallest (Minimum) and
• largest (Maximum) observations.
Basic Statistical Descriptions of Data (cont.)

• Variance & Standard Deviation


• Measures of data dispersion.
• Indicate how spread out a data distribution is.
• A low standard deviation: the data observations tend to be very close to the
mean
• A high standard deviation: the data are spread out over a large range of
values.
• Set of values: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
QUIZ

Calculate
• Mean= ??
• Mod= ??
• Median= ??
• MidRange=??
• Range=??
• Variance=?? Std= ??
• 4-Quartiles= ??
• BoxPlot??
Basic Statistical Descriptions of Data (cont.)
• Graphic Displays of Basic Statistical Descriptions of Data
• Quantile Plot:
• A quantile plot is a simple and effective way to have a first look at a univariate
data distribution.
• displays all of the data for the given attribute (allowing the user to assess both
the overall behavior and unusual occurrences).
• it plots quantile information
Basic Statistical Descriptions of Data (cont.)
• Graphic Displays of Basic Statistical Descriptions of Data
• Quantile–Quantile Plot:
• A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate
distribution against the corresponding quantiles of another.
• It is a powerful visualization tool in that it allows the user to view whether
there is a shift in going from one distribution to another.
Basic Statistical Descriptions of Data (cont.)
• Graphic Displays of Basic Statistical Descriptions of Data
• Histogram.
• Scatter Plot và Data Correlation.
Data Visualization
• An important skill when DM
• Easily understand data with charts, figures!!
• Limited in 3D space!!
• But data set always has more than 3 features!!!
• Many techniques to get the best visualization:
• Pixel-Oriented Visualization.
• Geometric Projection Visualization.
• Icon-Based Visualization.
• Hierarchical Visualization
Data Visualization (cont.)
• Pixel-Oriented Visualization
Data Visualization (cont.)
• Geometric Projection Visualization
Data Visualization (cont.)
• Geometric Projection Visualization (tt)
Data Visualization (cont.)
• Icon-Based Visualization
Data Visualization (cont.)
• Icon-Based Visualization (tt)
Data Visualization (cont.)

• Hierarchical Visualization

•\
Tree – Maps World-in-World
Measuring Data Similarity and Dissimilarity
• Some techniques such as clustering, outlier analysis, and nearest-
neighbor classification, we need ways to assess how alike or
unalike objects are in comparison to one another..
• Clustering bases on K-Means algorithm:
• Gather similarity objects into one group, an object in this group will be
different with any object in others.
• Detect abnormal object:
• Abnormal object has clearly differences with knowned objects.
• Classify with Knn algorithm:
• Find K objects which nearest with new object then predict label of new
one.
Measuring Data Similarity and Dissimilarity (cont.)
• Measure Dissimilarity = Measure distance?.
• Two objects very closed  very similarity  distance between them too
small!!
• Euclide:

• Manhattan:

• Minkowski:

• Chebyshev (Genearl of Minkowski when h ∞ )


Measuring Data Similarity and Dissimilarity (cont.)
Measuring Data Similarity and Dissimilarity (cont.)
• Data matrix (object-by-attribute structure):
• Data Matrix
• Data set has n objects and p features.
• Dissimilarity matrix (object-by-object structure):
• Dissimilarity Matrix.
• Symmatry Matrix with values >=0.
• d(i,j) = d (j, i) = dissimilarity between object i and
object j
• Of course d(i, i) = 0.
• Similarity of objects
• sim (i, j) = 1 – d(i,j)
Try with simple example

• Only use 2 features: Age and Height.


• Measure Unit: cm and kg (or m and kg)
• Calculate d(#1,#2), d(#1,#3),d(#2,#3)…
• Opinions?
Measuring Data Similarity and Dissimilarity (cont.)
• Problems:
• Numeric attribute: Easy!!!
• Nominal/Binary Nominal/Ordered Nominal: HOW!!!
• Dissimilarity with nominal features
• Data set has p nominal features.
• We can calculate dissimilarity (distance) between i and j:
• m: total number of features has same values.

• Similarity with nominal features


Measuring Data Similarity and Dissimilarity (cont.)
Data set has 4 objects.
Only use test-1 feature
Measuring Data Similarity and Dissimilarity (cont.)
• Dissimilarity (DIstance) with binary nominal features.
• Notice 2 cases: Symmetric và Asymmetric

Suppose data set has p binary features.


Total number of features has value 1: q
Total number of features has value 1: t
Total number of features has different value: r + s
Measuring Data Similarity and Dissimilarity (cont.)

Name: ID
Gender: symmetric.
Fever, Cough, Test-1, Test-2, Test-3, Test-4: asymmetric.
Yes/P: 1 (important)
No/N: 0
Try simple example

• Only use Gender, TestA and Blood.


• Calculate d(#1,#2), d(#1,#3),
d(#2,#3)…
• Opinions?
Measuring Data Similarity and Dissimilarity (cont.)
• Measure dissimilarity with ordinary features.
• Sorting values ascending.
• Normalize values in range [0, 1]
• Example: only use Obj ID và test-2
• Values in test-2: fair – good – excellent
• Ranking: 1(fair)-2(good)-3(excellent)
• Normalized: 1 0 , 2  0.5 và 3 1
• Using Euclide distance formula:
Try simple example

• Only use: Heart beat, Blood Pressure


• Calculate d(#1,#2), d(#1,#3),d(#2,#3)…
• Opinions?
Measuring Data Similarity and Dissimilarity (cont.)
• General:
• Data set has many features..
• Has many kind of data type.
• Example:
• Height, Weight: numeric.
• Gender: symmetric binary.
• Symtom 1, Symtom 2, Symtom 3: asymmetric binary
• Religion, Ethnic: Nominal.
• How is dissimilarity of object i and j ?
• Solution 1: Calculating dissimilarity each features then using a distance
formula at the last step
Try simple example

• Using whole features.


• Calculate d(An,Bích), d(An,Cúc),…
• Opinions?
Measuring Data Similarity and Dissimilarity (cont.)
• General case
Measuring Data Similarity and Dissimilarity (cont.)
• Trường hợp tổng quát (tt)
• Solution 2: Normalize all features in range [0,1], then calculating
dissimilarity of them.
• Data set has p features. Features in different kind of data type.
• Formula to measyre the dissimilarity between i and j
Feature Dissimilarity
Numeric

Nominal/Binary
Cases
0 - If xif/xjf: missing Ordinal
- If xif=xjf=0 and f asymmetric
1 Other cases
Measuring Data Similarity and Dissimilarity (cont.)
• Cosine Similarity
• Documents with statistic on ketwords.
• Total number of keyword appear in document term-frequency vectors
• Term-frequency vectors: multi dimension, sparse and asymmetric matrix.
• Using Cosine Similarity: very fast and easily when calculate.
Measuring Data Similarity and Dissimilarity (cont.)

• Cosine Similarity (cont.)


Homework
• Try some Data Mining free software (Weka, Orange, Knime,..)
• Load sample data sets of software.
• Observer, check, calculate,.. mean, median, mode, Q1, Q2, Q3,
variance, std deviation,… of sample data set
• Try and using functions. menus,.. of software to observe,
analys,..charts, histogram, correlation between features,…..
• How to import/export data set?
WEKA free software
WEKA free software (cont.)
WEKA free software (cont.)
Next session
• Pre-Processing data
• Why pre-processing data?
• Why pre-processing is very important?
• What pre-processing data does?.
• Pre-processing problems:
• Data integration.
• Data cleanong.
• Missing values, Noise data.
• Discretize.
• Extract data.
• Select features, Extract features.
• ….

You might also like