Attribute-Oriented Analysis
Presenter: Joveliza A. Trongcoso
Topic Outline
• Attribute generalization
• Attribute relevance
• Class comparison
• Statistical measures
• Experiments with weka – using filters and statistics
Data Objects
• Represents an entity
• Example in sales database, the objects may be customers, store
items, and sales
• Data objects are typically described by attributes.
• If the data objects are stored in a database, they are data tuples.
• That is the rows of a database correspond to the data objects, and the
columns correspond to the attributes
What is an Attribute?
• A data field representing a characteristic or feature of a data object.
• The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
• Attributes describing a customer object can include, for example,
customer_ID, name, and address
What is an Attribute? (cont’d)
• Observations – are the observed values for a given attribute
• Attribute vector (or feature vector) – set of attributes used to describe
a given object
• Univariate - distribution of data involving one attribute (or variable)
• Bivariate - distribution involves two attributes
Types of Attributes
1. Nominal
2. Binary
3. Ordinal
4. Numeric
Nominal Attributes
• The values of a nominal attribute are symbols or names of things.
• Nominal attributes are also referred to as categorical.
Example: hair_color and marital_status
Binary Attributes
• A nominal attribute with only two categories or states: 0 or 1
• 0 means absent; 1 means present
• Binary attributes are referred to as Boolean if the two states correspond
to true and false.
Example:
Attribute smoker describing a patient object
1 indicates that the patient smokes, while 0 indicates that the patient does not
Binary Attributes (cont’d)
• A binary attribute is symmetric if both of its states are equally
valuable and carry the same weight;
• Binary attribute is asymmetric if the outcomes of the states are not
equally important
Ordinal Attributes
• An attribute with possible values that have a meaningful order or
ranking among them.
Example:
drink_size (small, medium, and large)
professional_rank (private, private first class, specialist, corporal, and sergeant)
Ordinal Attributes (cont’d)
• Ordinal attributes are often used in surveys for ratings.
Example: Customer satisfaction had the following ordinal categories;
0: very dissatisfied,
1: somewhat dissatisfied,
2: neutral,
3: satisfied, and
4: very satisfied.
Numeric Attributes
• A numeric attribute is quantitative; that is, it is a measurable quantity,
represented in integer or real values.
• Numeric attributes can be interval-scaled or ratio-scaled.
Numeric Attributes (cont’d)
Interval-Scaled Attributes
• measured on a scale of equal-size units
• values have order
• allows to compare and quantify the difference between values.
Example:
• temperature of 20°C and 15°C
• Calendar dates 2010 and 2022
Numeric Attributes (cont’d)
Ratio-Scaled Attributes
• a numeric attribute with an inherent zero-point
• values are ordered, and we can also compute the difference between values,
as well as the mean, median, and mode
Example:
• count attributes such as years_of_experience (e.g., the objects are employees)
• number_of_words (e.g., the objects are documents)
• Additional examples include attributes to measure weight, height, latitude
and longitude coordinates (e.g., when clustering houses)
Attribute Generalization
• Attribute generalization is based on the following rule: “if there is a
large set of distinct values for an attribute, then a generalization
operator should be selected and applied to the attribute”
• Nominal attributes: the operation defines a sub-cube by performing a
selection on two or more dimensions. (Dropping condition)
• Structured attributes: climbing up concept hierarchy is used. Replacing a
value in an attribute value pair with a more general one. The operation
performs aggregation on data cube, either by climbing up a concept hierarchy
for a dimension or by dimension reduction.
Attribute Generalization (cont’d)
Example:
Set representation
Generalization
Y1 = {x2 = hot, x3 = high, x4 = weak} (X1
with first and last attributes dropped)
Attribute Relevance
Attribute relevance analysis is done in order to filter out statistically
irrelevant or weakly relevant attributes, and retain or even rank the
most relevant attributes for the descriptive mining task at hand.
Class Comparison
• Class discrimination or comparison (hereafter referred to as class
comparison) mines descriptions that distinguish a target class from its
contrasting classes.
• target and contrasting classes must be comparable and share similar
dimensions and attributes.
Class Comparison (cont’d)
Example: a class comparison describing the graduate and
undergraduate students at Big University.
Mining a class comparison. Suppose that you would like to compare the
general properties of the graduate and undergraduate students at Big
University, given the attributes name, gender, major, birth_place,
birth_date, residence, phone#, and gpa.
Class Comparison (cont’d)
This data mining task can be expressed in DMQL as follows:
Class Comparison (cont’d)
Class Comparison (cont’d)
Statistical Description of data
What Why
1. Measures of central tendency • To get overall picture of the data, basic
• Mean, median, mode statistical descriptions are used in data
• Location of the center of a data analysis
distribution • The statistical metrics can tell us if there
• Where do most of the attributes values are issues exist as extreme outliers and
fall? large deviation in the values of attributes
2. Dispersion measures
What is Outliers
• Range, quartiles, inter quartile range,
five-number summary and box plots, • Data values differs significantly from other values
variance and standard deviation. • It affect the mean value of the data but little
• It describes how are the data spread out. affect on median or mode.
Measures of Central Tendency
Example: We have the values for salary (in thousand dollars) 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
Mean – Average value of numeric Median – Middle value of numeric Mode – Most common value of
attribute attribute numeric attribute
Sort values in increasing order. It can be determined for
If N is odd, median is middle value qualitative and quantitative
of the ordered set. attributes.
If N is even, median is the average
The data from Example are
of the two middlemost values.
bimodal
Mean salary is $58,000. Median is $54,000. Modes are $52,000 and $70,000
Dispersion Measures
Example: We take data for any attribute X sorted in increasing numeric order
Range – The difference between the largest and smallest values of the attribute.
Quantiles – points takes at regular intervals dividing the data into equal size.
2-Quantile – a data point dividing the lower and upper halves of the data – Median
4-Quantiles – three data points that divide the data into four equal parts - Quartiles
100-Quantiles – divide the data values into 100 parts – Percentiles.
Dispersion Measures (Quartile)
A plot of the data distribution for an attribute X.
First quartile Q1 – 25th
Cuts off the lowest 25% of the data.
percentile
Third quartile Q3 – 75th
Cuts off the lowest 75% (or highest 25%) of the data.
percentile
Second quartile Q2 – 50th Median gives the center of the data distribution.
percentile The distance between the Q1 and Q3 gives the range covered by the middle half of
the data. This distance is called the Interquartile range. IQR=Q3-Q1
Experiments with Weka
using Filters