Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views22 pages

Unit I Preprocessing

The document discusses data preprocessing, emphasizing its importance for ensuring data quality through tasks such as data cleaning, integration, reduction, and transformation. It details methods for handling missing and noisy data, including regression and clustering techniques, and highlights the benefits of linear regression in predictive analysis. Additionally, it outlines various clustering types and algorithms, explaining their applications in market segmentation, anomaly detection, and simplifying large datasets.

Uploaded by

gbalu0061
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views22 pages

Unit I Preprocessing

The document discusses data preprocessing, emphasizing its importance for ensuring data quality through tasks such as data cleaning, integration, reduction, and transformation. It details methods for handling missing and noisy data, including regression and clustering techniques, and highlights the benefits of linear regression in predictive analysis. Additionally, it outlines various clustering types and algorithms, explaining their applications in market segmentation, anomaly detection, and simplifying large datasets.

Uploaded by

gbalu0061
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT I

DATA PREPROCESSING
Data preprocessing: Data cleaning, Data transformation, Data reduction, Discretization and
generating concept hierarchies. Attribute-oriented analysis: Attribute generalization, Attribute
relevance, Class comparison, Statistical measures.
Data Preprocessing
Data preprocessing is the process of transforming raw data into an understandable format.

Importance of Data Preprocessing


Preprocessing of data is mainly to check the data quality. The quality can be checked by
the following:
 Accuracy: To check whether the data entered is correct or not.
 Completeness: To check whether the data is available or not recorded.
 Consistency: To check whether the same data is kept in all the places that do or do
not match.
 Timeliness: The data should be updated correctly.
 Believability: The data should be trustable.
 Interpretability: The understandability of the data.

Major Tasks in Data Preprocessing

There are 4 major tasks in data preprocessing


 Data cleaning
 Data integration
 Data reduction
 Data transformation

 Data Cleaning

Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate
data from the datasets, and it also replaces the missing values. Here are some techniques
for data cleaning:
Handling Missing Values
 Standard values like “Not Available” or “NA” can be used to replace the missing
values.
 Missing values can also be filled manually, but it is not recommended when that
dataset is big.
 The attribute’s mean value can be used to replace the missing value when the data
is normally distributed
wherein in the case of non-normal distribution median value of the attribute can be
used.
 While using regression or decision tree algorithms, the missing value can be
replaced by the most probable value.
Handling Noisy Data
Noisy generally means random error or containing unnecessary data points. Handling
noisy data is one of the most important steps as it leads to the optimization of the model
we are using Here are some of the methods to handle noisy data.
 Binning: This method is to smooth or handle noisy data. First, the data is sorted
then, and then the sorted values are separated and stored in the form of bins. There
are three methods for smoothing data in the bin. Smoothing by bin mean method:
In this method, the values in the bin are replaced by the mean value of the
bin; Smoothing by bin median: In this method, the values in the bin are replaced
by the median value; Smoothing by bin boundary: In this method, the using
minimum and maximum values of the bin values are taken, and the closest
boundary value replaces the values.
Steps to follow:
1. Sort the array of a given data set.
2. Divides the range into N intervals, each containing the approximately same number
of samples(Equal-depth partitioning).
3. Store mean/ median/ boundaries in each row.
Example:

Sorted data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

Partition using equal frequency approach:


- Bin 1 : 4, 8, 9, 15
- Bin 2 : 21, 21, 24, 25
- Bin 3 : 26, 28, 29, 34

Smoothing by bin means:


- Bin 1: 9, 9, 9, 9  [(4+8+9+15)/4=9]
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29

Smoothing by bin median:


- Bin 1: 9 9, 9, 9  [(4,8,9,15)median=round((8+9)/2)=9]
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29

Smoothing by bin boundaries:


- Bin 1: 4, 4, 4, 15  [(4,8,9,15)x=x-min & y=y-max, if (x<y then xmin
else xmax) ]
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

 Regression: This is used to smooth the data and will help to handle data when
unnecessary data is present. For the analysis, purpose regression helps to decide
the variable which is suitable for our analysis.
Linear regression is an algorithm that provides a linear relationship between
an independent variable and a dependent variable to predict the outcome of future
events. It is a statistical method used in data science and machine learning for
predictive analysis.
 The independent variable is also the predictor or explanatory variable that
remains unchanged due to the change in other variables. However, the dependent
variable changes with fluctuations in the independent variable. The regression
model predicts the value of the dependent variable, which is the response or
outcome variable being analyzed or studied.
 Thus, linear regression is a supervised learning algorithm that simulates a
mathematical relationship between variables and makes predictions for
continuous or numeric variables such as sales, salary, age, product price, etc.
 This analysis method is advantageous when at least two variables are available in
the data, as observed in stock market forecasting, portfolio management,
scientific analysis, etc.
 A sloped straight line represents the linear regression model.

Best Fit Line for a Linear Regression Model


In the above figure,
X-axis = Independent variable
Y-axis = Output / dependent variable
Line of regression = Best fit line for a model

Here, a line is plotted for the given data points that suitably fit all the issues.
Hence, it is called the ‘best fit line.’ The goal of the linear regression algorithm is to
find this best fit line seen in the above figure.
Key benefits of linear regression
Linear regression is a popular statistical tool used in data science, thanks to the several benefits it
offers, such as:
1. Easy implementation
The linear regression model is computationally simple to implement as it does not
demand a lot of engineering overheads, neither before the model launch nor during its
maintenance.
2. Interpretability
Unlike other deep learning models (neural networks), linear regression is relatively
straightforward. As a result, this algorithm stands ahead of black-box models that fall
short in justifying which input variable causes the output variable to change.
3. Scalability
Linear regression is not computationally heavy and, therefore, fits well in cases where
scaling is essential. For example, the model can scale well regarding increased data
volume (big data).
4. Optimal for online settings
The ease of computation of these algorithms allows them to be used in online settings.
The model can be trained and retrained with each new example to generate
predictions in real-time, unlike the neural networks or support vector machines that
are computationally heavy and require plenty of computing resources and
substantial
waiting time to retrain on a new dataset. All these factors make such compute-
intensive models expensive and unsuitable for real-time applications.

Benefits of Linear Regression:

1. Easy implementation
The linear regression model is computationally simple to implement as it does not
demand a lot of engineering overheads, neither before the model launch nor during its
maintenance.
2. Interpretability
Unlike other deep learning models (neural networks), linear regression is relatively
straightforward. As a result, this algorithm stands ahead of black-box models that fall
short in justifying which input variable causes the output variable to change.
3. Scalability
Linear regression is not computationally heavy and, therefore, fits well in cases where
scaling is essential. For example, the model can scale well regarding increased data
volume (big data).
4. Optimal for online settings
The ease of computation of these algorithms allows them to be used in online settings.
The model can be trained and retrained with each new example to generate
predictions in real-time, unlike the neural networks or support vector machines that
are computationally heavy and require plenty of computing resources and substantial
waiting time to retrain on a new dataset. All these factors make such compute-
intensive models expensive and unsuitable for real-time applications.
The above features highlight why linear regression is a popular model to solve real-
life machine learning problems.
Linear Regression Equation
Mathematically linear regression is represented by the equation,
Y = m*X + b
Where
X = dependent variable (target)
Y = independent variable
m = slope of the line (slope is defined as the ‘rise’ over the ‘run’)

Example:
Let’s consider a dataset that covers RAM sizes and their corresponding costs.
In this case, the dataset comprises two distinct features: memory (capacity) and cost.
The more RAM, the more the purchase cost of RAMs.
Dataset: RAM Capacity vs. Cost

X (Ram Y
Capacity) (cost)
2 12
4 16
8 28
16 62
Plotting RAM on the X-axis and its cost on the Y-axis, a line from the lower-left
corner of the graph to the upper right represents the relationship between X and Y. On
plotting these data points on a scatter plot, we get the following graph:
 Clustering: This is used for finding the outliers and also in grouping the data.
Clustering is generally used in unsupervised learning.

The task of grouping data points based on their similarity with each other is
called Clustering or Cluster Analysis. This method is defined under the branch
of Unsupervised Learning, which aims at gaining insights from unlabelled data
points, that is, unlike supervised learning target variable is not defined.
Clustering aims at forming groups of homogeneous data points from a
heterogeneous dataset. It evaluates the similarity based on a metric like Euclidean
distance, Cosine similarity, Manhattan distance, etc. and then group the points with
highest similarity score together.
For Example, In the graph given below, we can clearly see that there are 3
circular clusters forming on the basis of distance.

Now it is not necessary that the clusters formed must be circular in shape.
The shape of clusters can be arbitrary. There are many algortihms that work well
with detecting arbitrary shaped clusters.
For example, In the below given graph we can see that the clusters formed
are not circular in shape.
Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group
similar data points:
Hard Clustering:
In this type of clustering, each data point belongs to a cluster completely or
not. For example, Let’s say there are 4 data point and we have to cluster them into
2 clusters. So each data point will either belong to cluster 1 or cluster 2.
Data Clusters
Points

A C1

B C2

C C2

D C1

Soft Clustering:
In this type of clustering, instead of assigning each data point into a separate
cluster, a probability or likelihood of that point being that cluster is evaluated. For
example, Let’s say there are 4 data point and we have to cluster them into 2 clusters.
So we will be evaluating a probability of a data point belonging to both clusters.
This probability is calculated for all data points.

Data Probability Probability


Points of C1 of C2

A 0.91 0.09

B 0.3 0.7

C 0.17 0.83

D 1 0

Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through the
use cases of Clustering algorithms. Clustering algorithms are majorly used for:
 Market Segmentation – Businesses use clustering to group their customers
and use targeted advertisements to attract more audience.
 Market Basket Analysis – Shop owners analyze their sales and figure out
which items are majorly bought together by the customers. For example, In
USA, according to a study diapers and beers were usually bought together
by fathers.
 Social Network Analysis – Social media sites use your data to understand
your browsing behaviour and provide you with targeted friend
recommendations or content recommendations.
 Medical Imaging – Doctors use Clustering to find out diseased areas in
diagnostic images like X-rays.
 Anomaly Detection – To find outliers in a stream of real-time dataset or
forecasting fraudulent transactions we can use clustering to identify them.
 Simplify working with large datasets – Each cluster is given a cluster ID
after clustering is complete. Now, you may reduce a feature set’s whole
feature set into its cluster ID. Clustering is effective when it can represent a
complicated case with a straightforward cluster ID. Using the same
principle, clustering data can make complex datasets simpler.

Types of Clustering Algorithms


At the surface level, clustering helps in the analysis of unstructured data.
Graphing, the shortest distance, and the density of the data points are a few of the
elements that influence cluster formation. Clustering is the process of determining
how related the objects are based on a metric called the similarity measure.
Similarity metrics are easier to locate in smaller sets of features. It gets harder to
create similarity measures as the number of features increases. Depending on the
type of clustering algorithm being utilized in data mining, several techniques are
employed to group the data from the datasets. In this part, the clustering techniques
are described. Various types of clustering algorithms are:
1. Centroid-based Clustering (Partitioning methods)
2. Density-based Clustering (Model-based methods)
3. Connectivity-based Clustering (Hierarchical clustering)
4. Distribution-based Clustering

1. Centroid-based Clustering (Partitioning methods)


Partitioning methods are the most easiest clustering algorithms. They group
data points on the basis of their closeness. Generally, the similarity measure chosen
for these algorithms are Euclidian distance, Manhattan Distance or Minkowski
Distance. The datasets are separated into a predetermined number of clusters, and
each cluster is referenced by a vector of values. When compared to the vector value,
the input data variable shows no difference and joins the cluster.
The primary drawback for these algorithms is the requirement that we establish the
number of clusters, “k,” either intuitively or scientifically (using the Elbow
Method) before any clustering machine learning system starts allocating the data
points. Despite this, it is still the most popular type of clustering. K-means and K-
medoids clustering are some examples of this type clustering.

2. Density-based Clustering (Model-based methods)


Density-based clustering, a model-based method, finds groups based on the
density of data points. Contrary to centroid-based clustering, which requires that
the number of clusters be predefined and is sensitive to initialization, density-based
clustering determines the number of clusters automatically and is less susceptible
to beginning positions. They are great at handling clusters of different sizes and
forms, making them ideally suited for datasets with irregularly shaped or
overlapping clusters. These methods manage both dense and sparse data regions by
focusing on local density and can distinguish clusters with a variety of
morphologies.
In contrast, centroid-based grouping, like k-means, has trouble finding
arbitrary shaped clusters. Due to its preset number of cluster requirements and
extreme sensitivity to the initial positioning of centroids, the outcomes can vary.
Furthermore, the tendency of centroid-based approaches to produce spherical or
convex clusters restricts their capacity to handle complicated or irregularly shaped
clusters. In conclusion, density-based clustering overcomes the drawbacks of
centroid-based techniques by autonomously choosing cluster sizes, being resilient
to initialization, and successfully capturing clusters of various sizes and forms. The
most popular density-based clustering algorithm is DBSCAN.

3. Connectivity-based Clustering (Hierarchical clustering)


A method for assembling related data points into hierarchical clusters is
called hierarchical clustering. Each data point is initially taken into account as a
separate cluster, which is subsequently combined with the clusters that are the most
similar to form one large cluster that contains all of the data points.
Think about how you may arrange a collection of items based on how similar they
are. Each object begins as its own cluster at the base of the tree when using
hierarchical clustering, which creates a dendrogram, a tree-like structure. The
closest pairings of clusters are then combined into larger clusters after the algorithm
examines how similar the objects are to one another. When every object is in one
cluster at the top of the tree, the merging process has finished. Exploring various
granularity levels is one of the fun things about hierarchical clustering. To obtain
a given number of clusters, you can select to cut the dendrogram at a particular
height. The more similar two objects are within a cluster, the closer they are. It’s
comparable to classifying items according to their family trees, where the nearest
relatives are clustered together and the wider branches signify more general
connections. There are 2 approaches for Hierarchical clustering:
 Divisive Clustering: It follows a top-down approach, here we consider all
data points to be part one big cluster and then this cluster is divide into smaller
groups.
 Agglomerative Clustering: It follows a bottom-up approach, here we
consider all data points to be part of individual clusters and then these clusters
are clubbed together to make one big cluster with all data points.

4. Distribution-based Clustering
Using distribution-based clustering, data points are generated and organized
according to their propensity to fall into the same probability distribution (such as
a Gaussian, binomial, or other) within the data. The data elements are grouped
using a probability-based distribution that is based on statistical distributions.
Included are data objects that have a higher likelihood of being in the cluster. A
data point is less likely to be included in a cluster the further it is from the cluster’s
central point, which exists in every cluster.
A notable drawback of density and boundary-based approaches is the need
to specify the clusters a priori for some algorithms, and primarily the definition of
the cluster form for the bulk of algorithms. There must be at least one tuning or
hyper-parameter selected, and while doing so should be simple, getting it wrong
could have unanticipated repercussions. Distribution-based clustering has a
definite advantage over proximity and centroid-based clustering approaches in
terms of flexibility, accuracy, and cluster structure. The key issue is that, in order
to avoid overfitting, many clustering methods only work with simulated or
manufactured data, or when the bulk of the data points certainly belong to a preset
distribution. The most popular distribution-based clustering algorithm is Gaussian
Mixture Model.
Applications of Clustering in different fields:
 Marketing: It can be used to characterize & discover customer segments
for marketing purposes.
 Biology: It can be used for classification among different species of
plants and animals.
 Libraries: It is used in clustering different books on the basis of topics
and information.
 Insurance: It is used to acknowledge the customers, their policies and
identifying the frauds.
 City Planning: It is used to make groups of houses and to study their
values based on their geographical locations and other factors present.
 Earthquake studies: By learning the earthquake-affected areas we can
determine the dangerous zones.
 Image Processing: Clustering can be used to group similar images
together, classify images based on content, and identify patterns in image
data.
 Genetics: Clustering is used to group genes that have similar expression
patterns and identify gene networks that work together in biological
processes.
 Finance: Clustering is used to identify market segments based on
customer behavior, identify patterns in stock market data, and analyze
risk in investment portfolios.
 Customer Service: Clustering is used to group customer inquiries and
complaints into categories, identify common issues, and develop targeted
solutions.
 Manufacturing: Clustering is used to group similar products together,
optimize production processes, and identify defects in manufacturing
processes.
 Medical diagnosis: Clustering is used to group patients with similar
symptoms or diseases, which helps in making accurate diagnoses and
identifying effective treatments.
 Fraud detection: Clustering is used to identify suspicious patterns or
anomalies in financial transactions, which can help in detecting fraud or
other financial crimes.
 Traffic analysis: Clustering is used to group similar patterns of traffic
data, such as peak hours, routes, and speeds, which can help in improving
transportation planning and infrastructure.
 Social network analysis: Clustering is used to identify communities or
groups within social networks, which can help in understanding social
behavior, influence, and trends.
 Cybersecurity: Clustering is used to group similar patterns of network
traffic or system behavior, which can help in detecting and preventing
cyberattacks.
 Climate analysis: Clustering is used to group similar patterns of climate
data, such as temperature, precipitation, and wind, which can help in
understanding climate change and its impact on the environment.
 Sports analysis: Clustering is used to group similar patterns of player or
team performance data, which can help in analyzing player or team
strengths and weaknesses and making strategic decisions.
 Crime analysis: Clustering is used to group similar patterns of crime
data, such as location, time, and type, which can help in identifying crime
hotspots, predicting future crime trends, and improving crime prevention
strategies.

 Data Integration
The process of combining multiple sources into a single dataset. The Data integration process is
one of the main components of data management. There are some problems to be
considered during data integration.

1. Schema integration:
Integrates metadata (a set of data that describes other data) from different sources.
Definition: Schema integration is used to merge two or more database schemas into a
single schema that can store data from both the original databases. For large databases
with many expected users and applications, the integration approach of designing
individual schema and then merging them can be used. Because the individual views can
be kept relatively small and simple. Schema Integration is divided into the following
subtask.
1. Identifying correspondences and conflicts among the schema:
As the schemas are designed individually it is necessary to specify constructs in the
schemas that represent the same real-world concept. We must identify these
correspondences before proceeding with the integration. During this process, several types
of conflicts may occur such as:
 Naming conflict
Naming conflicts are of two types synonyms and homonyms. A synonym occurs when
two schemas use different names to describe the same concept, for example, an
entity type CUSTOMER in one schema may describe an entity type CLIENT in
another schema. A homonym occurs when two schemas use the same name to
describe different concepts. For example, an entity type Classes may represent
TRAIN classes in one schema and AEROPLANE classes in another schema.
 Type conflicts
A similar concept may be represented in two schemas by different modeling constructs.
For example, DEPARTMENT may be an entity type in one schema and an attribute
in another.
 Domain conflicts
A single attribute may have different domains in different schemas. For example, we
may declare Ssn as an integer in one schema and a character string in another. A
conflict of the unit of measure could occur if one schema represented weight in
pounds and the other used kgs.
 Conflicts among constraints
Two schemas may impose different constraints, for example, the KEY of an entity type
may be different in each schema.

2. Modifying views to conform to one another:


Some schemas are modified so that they conform to other schemas more closely.
Some of the conflicts that may occur during the first steps are resolved in this step.
3. Merging of Views and Restructuring:
The global schemas are created by merging the individual schemas. Corresponding
concepts are represented only once in the global schema and mapping between the
views and the global schemas are specified. This is the hardest step to achieve in
real-world databases which involve hundreds of entities and relations. It involves a
considerable amount of human intervention and negotiation to resolve conflicts and
to settle on the most reasonable and acceptable solution for a global
schema. Restructuring As a final optional step the global schemas may be analyzed
and restructured to remove any redundancies or unnecessary complexity.
Schema integration is the process of combining multiple database schemas
into a single schema, which can be used to support data integration, data sharing,
and other data management tasks. Schema integration is necessary when working
with multiple databases or when integrating data from multiple sources, such as in
data warehousing, data integration, and business intelligence applications.
The process of schema integration involves several steps:
1. Identify the source schemas: The first step in schema integration is to identify the
schemas of the databases or data sources that need to be integrated.
2. Analyze the source schemas: Once the source schemas have been identified, they
should be analyzed to identify common attributes and data structures that can be
used to integrate the data.
3. Define the target schema: The target schema is the schema that will be used to
represent the integrated data. The target schema should be designed to support the
requirements of the application or task for which the data will be used.
4. Map the source schemas to the target schema: The next step in schema
integration is to map the attributes and data structures from the source schemas to
the target schema. This involves identifying the common attributes and creating
mappings between the source and target schema.
5. Merge the schemas: Once the source schemas have been mapped to the target
schema, the schemas can be merged to create a single schema that represents the
integrated data.
6. Resolve conflicts: Inevitably, conflicts will arise during the schema integration
process, such as data type conflicts, naming conflicts, or conflicts in data models.
These conflicts must be resolved to ensure the integrity of the integrated data.
7. Test the integrated schema: The final step in schema integration is to test the
integrated schema to ensure that it meets the requirements of the application or task
for which the data will be used.
Advantages:
 Provides a unified view of data: Schema integration enables the creation of a
single database that can be accessed by different users, departments or applications.
This makes it easier for users to access and work with data from different sources.
 Reduces data redundancy: By combining multiple databases into a single
integrated schema, data redundancy can be reduced, leading to more efficient use
of storage space and improved data consistency.
 Increases productivity: An integrated schema simplifies data management and
enables users to work more efficiently. This can lead to increased productivity and
reduced costs.
Enables better data analysis: An integrated schema provides a more
comprehensive view of data, making it easier to analyze and identify patterns and
relationships between different data sources.

Disadvantages:
 Complexity: Integrating multiple database schemas into a single schema can be a
complex and time-consuming process. It requires a detailed understanding of each
database schema and the relationships between them.
 Data inconsistencies: Combining multiple database schemas can result in data
inconsistencies if the schemas are not properly integrated. This can lead to errors
and incorrect results when querying the integrated database.
 Performance issues: The performance of the integrated database may be
negatively impacted if the integration is not properly optimized. This can result in
slower query response times and reduced system performance.
 Security concerns: Integrating multiple databases into a single schema can
increase the risk of security breaches, as it can be more difficult to control access
to data from different sources. Proper security measures must be put in place to
prevent unauthorized access to sensitive data.

2. Entity identification problem:

Entity Identification Problem helps in detecting and resolving data value


conflicts. Identifying entities from multiple databases. For example, the system or
the user should know the student id of one database and student_name of another
database belonging to the same entity.

Equivalent real-world entities from multiple data sources matched up are


referred to this problem. Entity Identification Problem occurs during the data
integration. During the integration of data from multiple resources, some data
resources match each other and they will become reductant if they are integrated.
For example: A.cust-id =B.cust-number. Here A, B are two different database
tables .cust-id is the attribute of table A,cust-number is the attribute of table B.
Here cust-id and cust-number are attributes of different tables and there is no
relationship between these tables but the cust-id attribute and cust-number attribute
are taking the same values. This is the example for Entity Identification Problem
in the relation. Meta Data can be used to avoid errors in such schema integration.
This ensures that functional dependencies and referential constraints in the source
system match in the target system.

3. Detecting and resolving data value concepts: The data taken from different databases
while merging may differ. The attribute values from one database may differ from
another database. For example, the date format may differ, like “MM/DD/YYYY” or
“DD/MM/YYYY”.

 Data Reduction
This process helps in the reduction of the volume of the data, which makes the analysis
easier yet produces the same or almost the same result. This reduction also helps to reduce
storage space. Some of the data reduction techniques are dimensionality reduction,
numerosity reduction, and data compression.
 Dimensionality reduction:
This process is necessary for real-world applications as the data size is big. In this
process, the reduction of random variables or attributes is done so that the
dimensionality of the data set can be reduced. Combining and merging the
attributes of the data without losing its original characteristics. This also helps in
the reduction of storage space, and computation time is reduced. When the data is
highly dimensional, a problem called the “Curse of Dimensionality” occurs.

There are two components of dimensionality reduction:


Feature selection:
In this, we try to find a subset of the original set of variables, or features, to
get a smaller subset which can be used to model the problem. It usually involves
three ways:
 Filter
 Wrapper
 Embedded
Feature extraction:
This reduces the data in a high dimensional space to a lower dimension
space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear and non-linear, depending upon
the method used. The prime linear method, called Principal Component Analysis,
or PCA, is discussed below.

Principal Component Analysis


This method was introduced by Karl Pearson. It works on the condition that
while the data in a higher dimensional space is mapped to data in a lower
dimension space, the variance of the data in the lower dimensional space should
be maximum.

It involves the following steps:


Construct the covariance matrix of the data. Compute
the eigenvectors of this matrix.
Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a
large fraction of variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been
some data loss in the process. But, the most important variances should be retained
by the remaining eigenvectors.
Advantages of Dimensionality Reduction
 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if any.
 Improved Visualization: High dimensional data is difficult to visualize, and
dimensionality reduction techniques can help in visualizing the data in 2D or
3D, which can help in better understanding and analysis.
 Overfitting Prevention: High dimensional data may lead to overfitting in
machine learning models, which can lead to poor generalization performance.
Dimensionality reduction can help in reducing the complexity of the data, and
hence prevent overfitting.
 Feature Extraction: Dimensionality reduction can help in extracting important
features from high dimensional data, which can be useful in feature selection
for machine learning models.
 Data Preprocessing: Dimensionality reduction can be used as a preprocessing
step before applying machine learning algorithms to reduce the
dimensionality of the data and hence improve the performance of the model.
 Improved Performance: Dimensionality reduction can help in improving the
performance of machine learning models by reducing the complexity of the
data, and hence reducing the noise and irrelevant information in the data.
Disadvantages of Dimensionality Reduction
 It may lead to some amount of data loss.
 PCA tends to find linear correlations between variables, which is sometimes
undesirable.
 PCA fails in cases where mean and covariance are not enough to define
datasets.
 We may not know how many principal components to keep- in practice, some
thumb rules are applied.
 Interpretability: The reduced dimensions may not be easily interpretable, and
it may be difficult to understand the relationship between the original features
and the reduced dimensions.
 Overfitting: In some cases, dimensionality reduction may lead to overfitting,
especially when the number of components is chosen based on the training
data.
 Sensitivity to outliers: Some dimensionality reduction techniques are
sensitive to outliers, which can result in a biased representation of the data.
 Computational complexity: Some dimensionality reduction techniques, such
as manifold learning, can be computationally intensive, especially when
dealing with large datasets.

 Numerosity Reduction:
 In this method, the representation of the data is made smaller by reducing the
volume. There will not be any loss of data in this reduction.
 Numerosity Reduction is a data reduction technique which replaces the
original data by smaller form of data representation. There are two techniques
for numerosity reduction- Parametric and Non-Parametric methods.
Parametric Methods –
For parametric methods, data is represented using some model. The model
is used to estimate the data, so that only parameters of data are required to be
stored, instead of actual data. Regression and Log-Linear methods are used for
creating such models.
Regression: Regression can be a simple linear regression or multiple linear
regression. When there is only single independent attribute, such regression model
is called simple linear regression and if there are multiple independent attributes,
then such regression models are called multiple linear regression. In linear
regression, the data are modeled to a fit straight line. For example, a random
variable y can be modeled as a linear function of another random variable x with
the equation y = ax+b where a and b (regression coefficients) specifies the slope
and y-intercept of the line, respectively. In multiple linear regression, y will be
modeled as a linear function of two or more predictor(independent) variables.
Log-Linear Model: Log-linear model can be used to estimate the
probability of each data point in a multidimensional space for a set of discretized
attributes, based on a smaller subset of dimensional combinations. This allows a
higher-dimensional data space to be constructed from lower-dimensional
attributes. Regression and log-linear model can both be used on sparse data,
although their application may be limited.
Non-Parametric Methods –
These methods are used for storing reduced representations of the data
include histograms, clustering, sampling and data cube aggregation.

Histograms: Histogram is the data representation in terms of frequency. It


uses binning to approximate data distribution and is a popular form of data
reduction.
Clustering: Clustering divides the data into groups/clusters. This technique
partitions the whole data into different clusters. In data reduction, the cluster
representation of the data are used to replace the actual data. It also helps to
detect outliers in data.
Sampling: Sampling can be used for data reduction because it allows a large
data set to be represented by a much smaller random data sample (or subset).
Data Cube Aggregation: Data cube aggregation involves moving the data
from detailed level to a fewer number of dimensions. The resulting data set is
smaller in volume, without loss of information necessary for the analysis task.
ADVANTAGES OR DISADVANTAGES:
Numerosity reduction can have both advantages and disadvantages when used in
data mining:
Advantages:
 Improved efficiency: Numerosity reduction can help to improve the
efficiency of machine learning algorithms by reducing the number of data
points in a dataset. This can make it faster and more practical to work with
large datasets.
 Improved performance: Numerosity reduction can help to improve the
performance of machine learning algorithms by removing irrelevant or
redundant data points from the dataset. This can help to make the model more
accurate and robust.
 Reduced storage costs: Numerosity reduction can help to reduce the storage
costs associated with large datasets by reducing the number of data points.
 Improved interpretability: Numerosity reduction can help to improve the
interpretability of the results by removing irrelevant or redundant data points
from the dataset.
Disadvantages:
 Loss of information: Numerosity reduction can result in a loss of information
if important data points are removed during the reduction process.
 Impact on accuracy: Numerosity reduction can impact the accuracy of a
model, as reducing the number of data points can also remove important
information that is needed for accurate predictions.
 Impact on interpretability: Numerosity reduction can make it harder to
interpret the results, as removing irrelevant or redundant data points can also
remove context that is needed to understand the results.
 Additional computational costs: Numerosity reduction can add additional
computational costs to the data mining process, as it requires additional
processing time to reduce the number of data points.
 Data compression:
The compressed form of data is called data compression. This compression can
be lossless or lossy. When there is no loss of information during compression, it is
called lossless compression. Whereas lossy compression reduces information, but
it removes only the unnecessary information.
Compression is achieved by removing redundancy, that is repetition of
unnecessary data.
To illustrate this method let’s assume that there are six symbols, and binary
code is used to assign a unique address to each of these symbols, as shown in the
following table
Binary code requires at least three bits to encode six symbols. It can also be
observed that binary codes 110 and 111 are not used at all. This clearly shows that
binary code is not efficient, and hence an efficient code is required to assign a
unique address.

Symbols W1 W2 W3 W4 W5 W6

Probability 0.3 0.3 0.1 0.1 0.08 0.02

Binary code 000 001 010 011 100 101

An efficient code is one that uses a minimum number of bits for representing any
information. The disadvantage of binary code is that it is fixed code; a Huffman
code is better, as it is a variable code.
Coding techniques are related to the concepts of entropy and information
content, which are studied as a subject called information theory. Information
theory also deals with uncertainty present in a message is called the information
content. The information content is given as
log2 (1/pi) or -log2 pi .
Entropy :
Entropy is defined as a measure of orderliness that is present in the
information. It is given as follows:
H= - ∑ pi log2 pi
Entropy is a positive quantity and specifies the minimum number of bits
necessary to encode information. Thus, coding redundancy is given as the
difference between the average number of bits used for coding and entropy.
coding redundancy = Average number of bits - Entropy
By removing redundancy, any information can be stored in a compact manner.
This is the basis of data compression.

 Data Transformation
The change made in the format or the structure of the data is called data transformation.
This step can be simple or complex based on the requirements. There are some methods
for data transformation.
 Smoothing:
With the help of algorithms, we can remove noise from the dataset, which
helps in knowing the important features of the dataset. By smoothing, we can find
even a simple change that helps in prediction (covered in UNIT I).
 Aggregation:
In this method, the data is stored and presented in the form of a summary.
The data set, which is from multiple sources, is integrated into with data analysis
description. This is an important step since the accuracy of the data depends on the
quantity and quality of the data. When the quality and the quantity of the data are
good, the results are more relevant.
Aggregation in data mining is the process of finding, collecting, and
presenting the data in a summarized format to perform statistical analysis of
business schemes or analysis of human patterns. When numerous data is collected
from various datasets, it’s crucial to gather accurate data to provide significant
results. Data aggregation can help in taking prudent decisions in marketing,
finance, pricing the product, etc. Aggregated data groups are replaced using
statistical summaries. Aggregated data being present in the data warehouse can help
one solve rational problems which in turn can reduce the time strain in solving
queries from data sets.
Examples of aggregate data:
 Finding the average age of customer buying a particular product which can
help in finding out the targeted age group for that particular product. Instead
of dealing with an individual customer, the average age of the customer is
calculated.
 Finding the number of consumers by country. This can increase sales in the
country with more buyers and help the company to enhance its marketing in
a country with low buyers. Here also, instead of an individual buyer, a group
of buyers in a country are considered.
 By collecting the data from online buyers, the company can analyze the
consumer behavior pattern, the success of the product which helps the
marketing and finance department to find new marketing strategies and planning
the budget.
 Finding the value of voter turnout in a state or country. It is done by counting
the total votes of a candidate in a particular region instead of counting the
individual voter records.

 Discretization:
The continuous data here is split into intervals. Discretization reduces the
data size. For example, rather than specifying the class time, we can set an interval
like (3 pm-5 pm, or 6 pm-8 pm).
Discretization is one form of data transformation technique. It transforms
numeric values to interval labels of conceptual labels. Ex. age can be transformed
to (0-10,11-20….) or to conceptual labels like youth, adult, senior.
Different techniques of discretization:
1. Discretization by binning: It is unsupervised method of partitioning the data
based on equal partitions , either by equal width or by equal frequency
2. Discretization by Cluster: clustering can be applied to discretize numeric
attributes. It partitions the values into different clusters or groups by following
top down or bottom up strategy
3. Discretization By decision tree: it employs top down splitting strategy. It is a
supervised technique that uses class information.
4. Discretization By correlation analysis: ChiMerge employs a bottom-up
approach by finding the best neighboring intervals and then merging them to
form larger intervals, recursively
5. Discretization by histogram: Histogram analysis is unsupervised learning
because it doesn’t use any class information like binning. There are various
partition rules used to define histograms.
 Normalization: It is the method of scaling the data so that it can be represented in
a smaller range. Example ranging from -1.0 to 1.0.
Need of Normalization
Normalization is generally required when we are dealing with attributes on a
different scale, otherwise, it may lead to a dilution in effectiveness of an important
equally important attribute(on lower scale) because of other attribute having values
on larger scale. In simple words, when multiple attributes are there but attributes
have values on different scales, this may lead to poor data models while
performing data mining operations. So they are normalized to bring all the
attributes on the same scale.

Methods of Data Normalization –


 Decimal Scaling
 Min-Max Normalization
 z-Score Normalization(zero-mean Normalization)
Decimal Scaling Method For Normalization –
It normalizes by moving the decimal point of values of the data. To normalize the
data by this technique, we divide each value of the data by the maximum
absolute value of data. The data value, vi, of data is normalized to vi‘ by using

𝑉𝑖
the formula below
𝑉𝑖′ =10𝑗

–where j is the smallest integer such that max(|vi‘|)

Example –
Let the input data is: -10, 201, 301, -401, 501, 601, 701 To
normalize the above data,
Step 1: Maximum absolute value in given data(m): 701
Step 2: Divide the given data by 1000 (i.e j=3)
Result: The normalized data is: -0.01, 0.201, 0.301, -0.401, 0.501, 0.601,
0.701
Min-Max Normalization –
In this technique of data normalization, linear transformation is
performed on the original data. Minimum and maximum value from data is
fetched and each value is replaced according to the following formula.

where A is the attribute data, Min(A), Max(A) are the minimum and
maximum absolute value of A respectively. v’ is the new value of each entry
in data. v is the old value of each entry in data. new_max(A), new_min(A) is
the max and min value of the range(i.e boundary value of range required)
respectively.
Z-score normalization –
In this technique, values are normalized based on mean and standard deviation of

𝑣 − 𝐴̅
the data A. The formula used is:

𝑣′ = 𝜎𝐴
v’, v is the new and old of each entry in data respectively. σA, A is the standard
deviation and mean of A respectively.
ADVANTAGES OR DISADVANTAGES:
Data normalization in data mining can have a number of advantages and disadvantages.
Advantages:
1. Improved performance of machine learning algorithms: Normalization can help to
improve the performance of machine learning algorithms by scaling the input features
to a common scale. This can help to reduce the impact of outliers and improve the
accuracy of the model.
2. Better handling of outliers: Normalization can help to reduce the impact of outliers by
scaling the data to a common scale, which can make the outliers less influential.
3. Improved interpretability of results: Normalization can make it easier to interpret the
results of a machine learning model, as the inputs will be on a common scale.
4. Better generalization: Normalization can help to improve the generalization of a
model, by reducing the impact of outliers and by making the model less sensitive to
the scale of the inputs.
Disadvantages:
1. Loss of information: Normalization can result in a loss of information if the original
scale of the input features is important.
2. Impact on outliers: Normalization can make it harder to detect outliers as they will be
scaled along with the rest of the data.
3. Impact on interpretability: Normalization can make it harder to interpret the results of
a machine learning model, as the inputs will be on a common scale, which may not
align with the original scale of the data.
4. Additional computational costs: Normalization can add additional computational costs
to the data mining process, as it requires additional processing time to scale the data.
5. In conclusion, data normalization can have both advantages and disadvantages. It can
improve the performance of machine learning algorithms and make it easier to
interpret the results. However, it can also result in a loss of information and make it
harder to detect outliers. It’s important to weigh the pros and cons of data
normalization and carefully assess the risks and benefits before implementing it.

Attribute-oriented analysis:
Introduction:
 Performing data mining analysis on databases is very tough because of the extensie
olume of data.
 Attribute oriented analysis is one such technique.
 Here the analysis is done on the basis of attributes. Attributes are selected and
generalised. And the patterns of knowledge ultimately formed are on the basis of
attributes only.
 Attribute is a property or characteristic of an object. A collection of attributes
describes an object.
Attribute Generalization
 Attribute generalization is based on the following rule: “If there is a large set of
distinct values for an attribute, then a generalization operator should be selected and
applied to the attribute.
 Nominal Attributes: The operation defines a sub-cube by performing a selection on
two or more dimensions.
 Structured attributes: Climbing up the concept hierarchy is used. Replacing an
attribute in
<attribute,value> pair with the more general one. The operation performs aggregation on
data cube either by climbing up a concept hierarchy for a dimension or by dimension
reduction.
Attribute Relevance
 The general idea behind attribute relevance analysis is to compute some measure
which is used to quantify the relevance of an attribute with respect to given class or
concept.
Attribute Selection
 Attribute selection is a term commonly used in data mining to describe the tools
and techniques available for reducing inputs to a manageable size for processing
and analysis.
 Attribute selection implies not only cardinality reduction but also the choice of
attributes based on their usefulness for analysis.
Selection Criteria
 Find a subset of attributes that is most likely to describe/predict the class best. The
following method may be used:
 Filtering: Filter type methods select variables regardless of the model. Filter
methods suppress the least interesting variables. These methods are
particularly effective in computation time and robust to over fitting.
Instance Based Attribute Selection
 Instance based filters: The goal of the instance-based search is to find the closest
decision boundary to the instance under consideration and assign weight to the
features that bring about the change.
Class Comparison
 In many applications, users may not be interested in having a single class described
or characterised, but rather would prefer to mine a description that compares or
distinguishes one class from other comparable classes. Class comparison mines
descriptions that distinguish a target class from its contrasting classes.
The general procedure for class comparison is as follows:
 Data Collection: The set of relevant data in the database is collected by
query processing and is partitioned respectively into a target css and
one or a set of contrasting class.
 Dimension relevance analysis: If there are many dimensions and analytical
comparisons is desired, then dimension relevance analysis should be
performed on these classes and only the highly relevant dimensions are
included in the further analysis.
 Synchronous Generalization: Generalization is performed on the target class to
the
level controlled by a user-or-expert-specified dimension threshold, which
results in a prime target class relation.
Presentation of the derived comparison: The resulting class comparison description
can be visualized in the form of tables, graphs and rules. This presentation usually
includes a “contrasting” measure (such as count %) that reflects the comparisons
between the target and contrasting classes.
Statistical Measures:
 The descriptive statistics are of great help in understanding the distribution of the
data. They help us choose an effective implementation.
Central Tendency
 Arithmetic Mean: Mean is the sum of a collection of numbers divided by the
number of numbers in the collection.
 Median: Median is the number separating the higher half of a data sample.
 Mode: Mode is the value that appears most often in a set of
data. Measuring Dispersion
 Variance (σ): Variance measures how far a set of numbers is spread out.
 Standard deviation (σ2) : Standard deviation is a measure that is used to quantify the
amount of variation or dispersion of a set of data values.

You might also like