Unit I Preprocessing
Unit I Preprocessing
DATA PREPROCESSING
Data preprocessing: Data cleaning, Data transformation, Data reduction, Discretization and
generating concept hierarchies. Attribute-oriented analysis: Attribute generalization, Attribute
relevance, Class comparison, Statistical measures.
Data Preprocessing
Data preprocessing is the process of transforming raw data into an understandable format.
Data Cleaning
Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate
data from the datasets, and it also replaces the missing values. Here are some techniques
for data cleaning:
Handling Missing Values
Standard values like “Not Available” or “NA” can be used to replace the missing
values.
Missing values can also be filled manually, but it is not recommended when that
dataset is big.
The attribute’s mean value can be used to replace the missing value when the data
is normally distributed
wherein in the case of non-normal distribution median value of the attribute can be
used.
While using regression or decision tree algorithms, the missing value can be
replaced by the most probable value.
Handling Noisy Data
Noisy generally means random error or containing unnecessary data points. Handling
noisy data is one of the most important steps as it leads to the optimization of the model
we are using Here are some of the methods to handle noisy data.
Binning: This method is to smooth or handle noisy data. First, the data is sorted
then, and then the sorted values are separated and stored in the form of bins. There
are three methods for smoothing data in the bin. Smoothing by bin mean method:
In this method, the values in the bin are replaced by the mean value of the
bin; Smoothing by bin median: In this method, the values in the bin are replaced
by the median value; Smoothing by bin boundary: In this method, the using
minimum and maximum values of the bin values are taken, and the closest
boundary value replaces the values.
Steps to follow:
1. Sort the array of a given data set.
2. Divides the range into N intervals, each containing the approximately same number
of samples(Equal-depth partitioning).
3. Store mean/ median/ boundaries in each row.
Example:
Sorted data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Regression: This is used to smooth the data and will help to handle data when
unnecessary data is present. For the analysis, purpose regression helps to decide
the variable which is suitable for our analysis.
Linear regression is an algorithm that provides a linear relationship between
an independent variable and a dependent variable to predict the outcome of future
events. It is a statistical method used in data science and machine learning for
predictive analysis.
The independent variable is also the predictor or explanatory variable that
remains unchanged due to the change in other variables. However, the dependent
variable changes with fluctuations in the independent variable. The regression
model predicts the value of the dependent variable, which is the response or
outcome variable being analyzed or studied.
Thus, linear regression is a supervised learning algorithm that simulates a
mathematical relationship between variables and makes predictions for
continuous or numeric variables such as sales, salary, age, product price, etc.
This analysis method is advantageous when at least two variables are available in
the data, as observed in stock market forecasting, portfolio management,
scientific analysis, etc.
A sloped straight line represents the linear regression model.
Here, a line is plotted for the given data points that suitably fit all the issues.
Hence, it is called the ‘best fit line.’ The goal of the linear regression algorithm is to
find this best fit line seen in the above figure.
Key benefits of linear regression
Linear regression is a popular statistical tool used in data science, thanks to the several benefits it
offers, such as:
1. Easy implementation
The linear regression model is computationally simple to implement as it does not
demand a lot of engineering overheads, neither before the model launch nor during its
maintenance.
2. Interpretability
Unlike other deep learning models (neural networks), linear regression is relatively
straightforward. As a result, this algorithm stands ahead of black-box models that fall
short in justifying which input variable causes the output variable to change.
3. Scalability
Linear regression is not computationally heavy and, therefore, fits well in cases where
scaling is essential. For example, the model can scale well regarding increased data
volume (big data).
4. Optimal for online settings
The ease of computation of these algorithms allows them to be used in online settings.
The model can be trained and retrained with each new example to generate
predictions in real-time, unlike the neural networks or support vector machines that
are computationally heavy and require plenty of computing resources and
substantial
waiting time to retrain on a new dataset. All these factors make such compute-
intensive models expensive and unsuitable for real-time applications.
1. Easy implementation
The linear regression model is computationally simple to implement as it does not
demand a lot of engineering overheads, neither before the model launch nor during its
maintenance.
2. Interpretability
Unlike other deep learning models (neural networks), linear regression is relatively
straightforward. As a result, this algorithm stands ahead of black-box models that fall
short in justifying which input variable causes the output variable to change.
3. Scalability
Linear regression is not computationally heavy and, therefore, fits well in cases where
scaling is essential. For example, the model can scale well regarding increased data
volume (big data).
4. Optimal for online settings
The ease of computation of these algorithms allows them to be used in online settings.
The model can be trained and retrained with each new example to generate
predictions in real-time, unlike the neural networks or support vector machines that
are computationally heavy and require plenty of computing resources and substantial
waiting time to retrain on a new dataset. All these factors make such compute-
intensive models expensive and unsuitable for real-time applications.
The above features highlight why linear regression is a popular model to solve real-
life machine learning problems.
Linear Regression Equation
Mathematically linear regression is represented by the equation,
Y = m*X + b
Where
X = dependent variable (target)
Y = independent variable
m = slope of the line (slope is defined as the ‘rise’ over the ‘run’)
Example:
Let’s consider a dataset that covers RAM sizes and their corresponding costs.
In this case, the dataset comprises two distinct features: memory (capacity) and cost.
The more RAM, the more the purchase cost of RAMs.
Dataset: RAM Capacity vs. Cost
X (Ram Y
Capacity) (cost)
2 12
4 16
8 28
16 62
Plotting RAM on the X-axis and its cost on the Y-axis, a line from the lower-left
corner of the graph to the upper right represents the relationship between X and Y. On
plotting these data points on a scatter plot, we get the following graph:
Clustering: This is used for finding the outliers and also in grouping the data.
Clustering is generally used in unsupervised learning.
The task of grouping data points based on their similarity with each other is
called Clustering or Cluster Analysis. This method is defined under the branch
of Unsupervised Learning, which aims at gaining insights from unlabelled data
points, that is, unlike supervised learning target variable is not defined.
Clustering aims at forming groups of homogeneous data points from a
heterogeneous dataset. It evaluates the similarity based on a metric like Euclidean
distance, Cosine similarity, Manhattan distance, etc. and then group the points with
highest similarity score together.
For Example, In the graph given below, we can clearly see that there are 3
circular clusters forming on the basis of distance.
Now it is not necessary that the clusters formed must be circular in shape.
The shape of clusters can be arbitrary. There are many algortihms that work well
with detecting arbitrary shaped clusters.
For example, In the below given graph we can see that the clusters formed
are not circular in shape.
Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group
similar data points:
Hard Clustering:
In this type of clustering, each data point belongs to a cluster completely or
not. For example, Let’s say there are 4 data point and we have to cluster them into
2 clusters. So each data point will either belong to cluster 1 or cluster 2.
Data Clusters
Points
A C1
B C2
C C2
D C1
Soft Clustering:
In this type of clustering, instead of assigning each data point into a separate
cluster, a probability or likelihood of that point being that cluster is evaluated. For
example, Let’s say there are 4 data point and we have to cluster them into 2 clusters.
So we will be evaluating a probability of a data point belonging to both clusters.
This probability is calculated for all data points.
A 0.91 0.09
B 0.3 0.7
C 0.17 0.83
D 1 0
Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through the
use cases of Clustering algorithms. Clustering algorithms are majorly used for:
Market Segmentation – Businesses use clustering to group their customers
and use targeted advertisements to attract more audience.
Market Basket Analysis – Shop owners analyze their sales and figure out
which items are majorly bought together by the customers. For example, In
USA, according to a study diapers and beers were usually bought together
by fathers.
Social Network Analysis – Social media sites use your data to understand
your browsing behaviour and provide you with targeted friend
recommendations or content recommendations.
Medical Imaging – Doctors use Clustering to find out diseased areas in
diagnostic images like X-rays.
Anomaly Detection – To find outliers in a stream of real-time dataset or
forecasting fraudulent transactions we can use clustering to identify them.
Simplify working with large datasets – Each cluster is given a cluster ID
after clustering is complete. Now, you may reduce a feature set’s whole
feature set into its cluster ID. Clustering is effective when it can represent a
complicated case with a straightforward cluster ID. Using the same
principle, clustering data can make complex datasets simpler.
4. Distribution-based Clustering
Using distribution-based clustering, data points are generated and organized
according to their propensity to fall into the same probability distribution (such as
a Gaussian, binomial, or other) within the data. The data elements are grouped
using a probability-based distribution that is based on statistical distributions.
Included are data objects that have a higher likelihood of being in the cluster. A
data point is less likely to be included in a cluster the further it is from the cluster’s
central point, which exists in every cluster.
A notable drawback of density and boundary-based approaches is the need
to specify the clusters a priori for some algorithms, and primarily the definition of
the cluster form for the bulk of algorithms. There must be at least one tuning or
hyper-parameter selected, and while doing so should be simple, getting it wrong
could have unanticipated repercussions. Distribution-based clustering has a
definite advantage over proximity and centroid-based clustering approaches in
terms of flexibility, accuracy, and cluster structure. The key issue is that, in order
to avoid overfitting, many clustering methods only work with simulated or
manufactured data, or when the bulk of the data points certainly belong to a preset
distribution. The most popular distribution-based clustering algorithm is Gaussian
Mixture Model.
Applications of Clustering in different fields:
Marketing: It can be used to characterize & discover customer segments
for marketing purposes.
Biology: It can be used for classification among different species of
plants and animals.
Libraries: It is used in clustering different books on the basis of topics
and information.
Insurance: It is used to acknowledge the customers, their policies and
identifying the frauds.
City Planning: It is used to make groups of houses and to study their
values based on their geographical locations and other factors present.
Earthquake studies: By learning the earthquake-affected areas we can
determine the dangerous zones.
Image Processing: Clustering can be used to group similar images
together, classify images based on content, and identify patterns in image
data.
Genetics: Clustering is used to group genes that have similar expression
patterns and identify gene networks that work together in biological
processes.
Finance: Clustering is used to identify market segments based on
customer behavior, identify patterns in stock market data, and analyze
risk in investment portfolios.
Customer Service: Clustering is used to group customer inquiries and
complaints into categories, identify common issues, and develop targeted
solutions.
Manufacturing: Clustering is used to group similar products together,
optimize production processes, and identify defects in manufacturing
processes.
Medical diagnosis: Clustering is used to group patients with similar
symptoms or diseases, which helps in making accurate diagnoses and
identifying effective treatments.
Fraud detection: Clustering is used to identify suspicious patterns or
anomalies in financial transactions, which can help in detecting fraud or
other financial crimes.
Traffic analysis: Clustering is used to group similar patterns of traffic
data, such as peak hours, routes, and speeds, which can help in improving
transportation planning and infrastructure.
Social network analysis: Clustering is used to identify communities or
groups within social networks, which can help in understanding social
behavior, influence, and trends.
Cybersecurity: Clustering is used to group similar patterns of network
traffic or system behavior, which can help in detecting and preventing
cyberattacks.
Climate analysis: Clustering is used to group similar patterns of climate
data, such as temperature, precipitation, and wind, which can help in
understanding climate change and its impact on the environment.
Sports analysis: Clustering is used to group similar patterns of player or
team performance data, which can help in analyzing player or team
strengths and weaknesses and making strategic decisions.
Crime analysis: Clustering is used to group similar patterns of crime
data, such as location, time, and type, which can help in identifying crime
hotspots, predicting future crime trends, and improving crime prevention
strategies.
Data Integration
The process of combining multiple sources into a single dataset. The Data integration process is
one of the main components of data management. There are some problems to be
considered during data integration.
1. Schema integration:
Integrates metadata (a set of data that describes other data) from different sources.
Definition: Schema integration is used to merge two or more database schemas into a
single schema that can store data from both the original databases. For large databases
with many expected users and applications, the integration approach of designing
individual schema and then merging them can be used. Because the individual views can
be kept relatively small and simple. Schema Integration is divided into the following
subtask.
1. Identifying correspondences and conflicts among the schema:
As the schemas are designed individually it is necessary to specify constructs in the
schemas that represent the same real-world concept. We must identify these
correspondences before proceeding with the integration. During this process, several types
of conflicts may occur such as:
Naming conflict
Naming conflicts are of two types synonyms and homonyms. A synonym occurs when
two schemas use different names to describe the same concept, for example, an
entity type CUSTOMER in one schema may describe an entity type CLIENT in
another schema. A homonym occurs when two schemas use the same name to
describe different concepts. For example, an entity type Classes may represent
TRAIN classes in one schema and AEROPLANE classes in another schema.
Type conflicts
A similar concept may be represented in two schemas by different modeling constructs.
For example, DEPARTMENT may be an entity type in one schema and an attribute
in another.
Domain conflicts
A single attribute may have different domains in different schemas. For example, we
may declare Ssn as an integer in one schema and a character string in another. A
conflict of the unit of measure could occur if one schema represented weight in
pounds and the other used kgs.
Conflicts among constraints
Two schemas may impose different constraints, for example, the KEY of an entity type
may be different in each schema.
Disadvantages:
Complexity: Integrating multiple database schemas into a single schema can be a
complex and time-consuming process. It requires a detailed understanding of each
database schema and the relationships between them.
Data inconsistencies: Combining multiple database schemas can result in data
inconsistencies if the schemas are not properly integrated. This can lead to errors
and incorrect results when querying the integrated database.
Performance issues: The performance of the integrated database may be
negatively impacted if the integration is not properly optimized. This can result in
slower query response times and reduced system performance.
Security concerns: Integrating multiple databases into a single schema can
increase the risk of security breaches, as it can be more difficult to control access
to data from different sources. Proper security measures must be put in place to
prevent unauthorized access to sensitive data.
3. Detecting and resolving data value concepts: The data taken from different databases
while merging may differ. The attribute values from one database may differ from
another database. For example, the date format may differ, like “MM/DD/YYYY” or
“DD/MM/YYYY”.
Data Reduction
This process helps in the reduction of the volume of the data, which makes the analysis
easier yet produces the same or almost the same result. This reduction also helps to reduce
storage space. Some of the data reduction techniques are dimensionality reduction,
numerosity reduction, and data compression.
Dimensionality reduction:
This process is necessary for real-world applications as the data size is big. In this
process, the reduction of random variables or attributes is done so that the
dimensionality of the data set can be reduced. Combining and merging the
attributes of the data without losing its original characteristics. This also helps in
the reduction of storage space, and computation time is reduced. When the data is
highly dimensional, a problem called the “Curse of Dimensionality” occurs.
Numerosity Reduction:
In this method, the representation of the data is made smaller by reducing the
volume. There will not be any loss of data in this reduction.
Numerosity Reduction is a data reduction technique which replaces the
original data by smaller form of data representation. There are two techniques
for numerosity reduction- Parametric and Non-Parametric methods.
Parametric Methods –
For parametric methods, data is represented using some model. The model
is used to estimate the data, so that only parameters of data are required to be
stored, instead of actual data. Regression and Log-Linear methods are used for
creating such models.
Regression: Regression can be a simple linear regression or multiple linear
regression. When there is only single independent attribute, such regression model
is called simple linear regression and if there are multiple independent attributes,
then such regression models are called multiple linear regression. In linear
regression, the data are modeled to a fit straight line. For example, a random
variable y can be modeled as a linear function of another random variable x with
the equation y = ax+b where a and b (regression coefficients) specifies the slope
and y-intercept of the line, respectively. In multiple linear regression, y will be
modeled as a linear function of two or more predictor(independent) variables.
Log-Linear Model: Log-linear model can be used to estimate the
probability of each data point in a multidimensional space for a set of discretized
attributes, based on a smaller subset of dimensional combinations. This allows a
higher-dimensional data space to be constructed from lower-dimensional
attributes. Regression and log-linear model can both be used on sparse data,
although their application may be limited.
Non-Parametric Methods –
These methods are used for storing reduced representations of the data
include histograms, clustering, sampling and data cube aggregation.
Symbols W1 W2 W3 W4 W5 W6
An efficient code is one that uses a minimum number of bits for representing any
information. The disadvantage of binary code is that it is fixed code; a Huffman
code is better, as it is a variable code.
Coding techniques are related to the concepts of entropy and information
content, which are studied as a subject called information theory. Information
theory also deals with uncertainty present in a message is called the information
content. The information content is given as
log2 (1/pi) or -log2 pi .
Entropy :
Entropy is defined as a measure of orderliness that is present in the
information. It is given as follows:
H= - ∑ pi log2 pi
Entropy is a positive quantity and specifies the minimum number of bits
necessary to encode information. Thus, coding redundancy is given as the
difference between the average number of bits used for coding and entropy.
coding redundancy = Average number of bits - Entropy
By removing redundancy, any information can be stored in a compact manner.
This is the basis of data compression.
Data Transformation
The change made in the format or the structure of the data is called data transformation.
This step can be simple or complex based on the requirements. There are some methods
for data transformation.
Smoothing:
With the help of algorithms, we can remove noise from the dataset, which
helps in knowing the important features of the dataset. By smoothing, we can find
even a simple change that helps in prediction (covered in UNIT I).
Aggregation:
In this method, the data is stored and presented in the form of a summary.
The data set, which is from multiple sources, is integrated into with data analysis
description. This is an important step since the accuracy of the data depends on the
quantity and quality of the data. When the quality and the quantity of the data are
good, the results are more relevant.
Aggregation in data mining is the process of finding, collecting, and
presenting the data in a summarized format to perform statistical analysis of
business schemes or analysis of human patterns. When numerous data is collected
from various datasets, it’s crucial to gather accurate data to provide significant
results. Data aggregation can help in taking prudent decisions in marketing,
finance, pricing the product, etc. Aggregated data groups are replaced using
statistical summaries. Aggregated data being present in the data warehouse can help
one solve rational problems which in turn can reduce the time strain in solving
queries from data sets.
Examples of aggregate data:
Finding the average age of customer buying a particular product which can
help in finding out the targeted age group for that particular product. Instead
of dealing with an individual customer, the average age of the customer is
calculated.
Finding the number of consumers by country. This can increase sales in the
country with more buyers and help the company to enhance its marketing in
a country with low buyers. Here also, instead of an individual buyer, a group
of buyers in a country are considered.
By collecting the data from online buyers, the company can analyze the
consumer behavior pattern, the success of the product which helps the
marketing and finance department to find new marketing strategies and planning
the budget.
Finding the value of voter turnout in a state or country. It is done by counting
the total votes of a candidate in a particular region instead of counting the
individual voter records.
Discretization:
The continuous data here is split into intervals. Discretization reduces the
data size. For example, rather than specifying the class time, we can set an interval
like (3 pm-5 pm, or 6 pm-8 pm).
Discretization is one form of data transformation technique. It transforms
numeric values to interval labels of conceptual labels. Ex. age can be transformed
to (0-10,11-20….) or to conceptual labels like youth, adult, senior.
Different techniques of discretization:
1. Discretization by binning: It is unsupervised method of partitioning the data
based on equal partitions , either by equal width or by equal frequency
2. Discretization by Cluster: clustering can be applied to discretize numeric
attributes. It partitions the values into different clusters or groups by following
top down or bottom up strategy
3. Discretization By decision tree: it employs top down splitting strategy. It is a
supervised technique that uses class information.
4. Discretization By correlation analysis: ChiMerge employs a bottom-up
approach by finding the best neighboring intervals and then merging them to
form larger intervals, recursively
5. Discretization by histogram: Histogram analysis is unsupervised learning
because it doesn’t use any class information like binning. There are various
partition rules used to define histograms.
Normalization: It is the method of scaling the data so that it can be represented in
a smaller range. Example ranging from -1.0 to 1.0.
Need of Normalization
Normalization is generally required when we are dealing with attributes on a
different scale, otherwise, it may lead to a dilution in effectiveness of an important
equally important attribute(on lower scale) because of other attribute having values
on larger scale. In simple words, when multiple attributes are there but attributes
have values on different scales, this may lead to poor data models while
performing data mining operations. So they are normalized to bring all the
attributes on the same scale.
𝑉𝑖
the formula below
𝑉𝑖′ =10𝑗
Example –
Let the input data is: -10, 201, 301, -401, 501, 601, 701 To
normalize the above data,
Step 1: Maximum absolute value in given data(m): 701
Step 2: Divide the given data by 1000 (i.e j=3)
Result: The normalized data is: -0.01, 0.201, 0.301, -0.401, 0.501, 0.601,
0.701
Min-Max Normalization –
In this technique of data normalization, linear transformation is
performed on the original data. Minimum and maximum value from data is
fetched and each value is replaced according to the following formula.
where A is the attribute data, Min(A), Max(A) are the minimum and
maximum absolute value of A respectively. v’ is the new value of each entry
in data. v is the old value of each entry in data. new_max(A), new_min(A) is
the max and min value of the range(i.e boundary value of range required)
respectively.
Z-score normalization –
In this technique, values are normalized based on mean and standard deviation of
𝑣 − 𝐴̅
the data A. The formula used is:
𝑣′ = 𝜎𝐴
v’, v is the new and old of each entry in data respectively. σA, A is the standard
deviation and mean of A respectively.
ADVANTAGES OR DISADVANTAGES:
Data normalization in data mining can have a number of advantages and disadvantages.
Advantages:
1. Improved performance of machine learning algorithms: Normalization can help to
improve the performance of machine learning algorithms by scaling the input features
to a common scale. This can help to reduce the impact of outliers and improve the
accuracy of the model.
2. Better handling of outliers: Normalization can help to reduce the impact of outliers by
scaling the data to a common scale, which can make the outliers less influential.
3. Improved interpretability of results: Normalization can make it easier to interpret the
results of a machine learning model, as the inputs will be on a common scale.
4. Better generalization: Normalization can help to improve the generalization of a
model, by reducing the impact of outliers and by making the model less sensitive to
the scale of the inputs.
Disadvantages:
1. Loss of information: Normalization can result in a loss of information if the original
scale of the input features is important.
2. Impact on outliers: Normalization can make it harder to detect outliers as they will be
scaled along with the rest of the data.
3. Impact on interpretability: Normalization can make it harder to interpret the results of
a machine learning model, as the inputs will be on a common scale, which may not
align with the original scale of the data.
4. Additional computational costs: Normalization can add additional computational costs
to the data mining process, as it requires additional processing time to scale the data.
5. In conclusion, data normalization can have both advantages and disadvantages. It can
improve the performance of machine learning algorithms and make it easier to
interpret the results. However, it can also result in a loss of information and make it
harder to detect outliers. It’s important to weigh the pros and cons of data
normalization and carefully assess the risks and benefits before implementing it.
Attribute-oriented analysis:
Introduction:
Performing data mining analysis on databases is very tough because of the extensie
olume of data.
Attribute oriented analysis is one such technique.
Here the analysis is done on the basis of attributes. Attributes are selected and
generalised. And the patterns of knowledge ultimately formed are on the basis of
attributes only.
Attribute is a property or characteristic of an object. A collection of attributes
describes an object.
Attribute Generalization
Attribute generalization is based on the following rule: “If there is a large set of
distinct values for an attribute, then a generalization operator should be selected and
applied to the attribute.
Nominal Attributes: The operation defines a sub-cube by performing a selection on
two or more dimensions.
Structured attributes: Climbing up the concept hierarchy is used. Replacing an
attribute in
<attribute,value> pair with the more general one. The operation performs aggregation on
data cube either by climbing up a concept hierarchy for a dimension or by dimension
reduction.
Attribute Relevance
The general idea behind attribute relevance analysis is to compute some measure
which is used to quantify the relevance of an attribute with respect to given class or
concept.
Attribute Selection
Attribute selection is a term commonly used in data mining to describe the tools
and techniques available for reducing inputs to a manageable size for processing
and analysis.
Attribute selection implies not only cardinality reduction but also the choice of
attributes based on their usefulness for analysis.
Selection Criteria
Find a subset of attributes that is most likely to describe/predict the class best. The
following method may be used:
Filtering: Filter type methods select variables regardless of the model. Filter
methods suppress the least interesting variables. These methods are
particularly effective in computation time and robust to over fitting.
Instance Based Attribute Selection
Instance based filters: The goal of the instance-based search is to find the closest
decision boundary to the instance under consideration and assign weight to the
features that bring about the change.
Class Comparison
In many applications, users may not be interested in having a single class described
or characterised, but rather would prefer to mine a description that compares or
distinguishes one class from other comparable classes. Class comparison mines
descriptions that distinguish a target class from its contrasting classes.
The general procedure for class comparison is as follows:
Data Collection: The set of relevant data in the database is collected by
query processing and is partitioned respectively into a target css and
one or a set of contrasting class.
Dimension relevance analysis: If there are many dimensions and analytical
comparisons is desired, then dimension relevance analysis should be
performed on these classes and only the highly relevant dimensions are
included in the further analysis.
Synchronous Generalization: Generalization is performed on the target class to
the
level controlled by a user-or-expert-specified dimension threshold, which
results in a prime target class relation.
Presentation of the derived comparison: The resulting class comparison description
can be visualized in the form of tables, graphs and rules. This presentation usually
includes a “contrasting” measure (such as count %) that reflects the comparisons
between the target and contrasting classes.
Statistical Measures:
The descriptive statistics are of great help in understanding the distribution of the
data. They help us choose an effective implementation.
Central Tendency
Arithmetic Mean: Mean is the sum of a collection of numbers divided by the
number of numbers in the collection.
Median: Median is the number separating the higher half of a data sample.
Mode: Mode is the value that appears most often in a set of
data. Measuring Dispersion
Variance (σ): Variance measures how far a set of numbers is spread out.
Standard deviation (σ2) : Standard deviation is a measure that is used to quantify the
amount of variation or dispersion of a set of data values.