Unit III Data Analysis
Unit III Data Analysis
REGRESSION MODELING
The techniques have been developed in response to methods of data collection in which
the usual assumptions of regression modeling are not justified. Multivariate analysis of
variance concerns problems in which the output is a vector, rather than a single number.
One way to analyze this sort of data would be to model each element of the output
separately, but this ignores the possible relationship between different elements of the
output vector. In other words, this analysis would be based on the assumption that the
different elements of the output vector are not related to one another.
Multivariate analysis of variance is a form of analysis that does allow for different
elements of the output vector to be correlated to one another
Repeated measures of data arise when there is a wish to model an output over time. A
natural way to perform such a study is to measure the output and the inputs for the same
set of individuals at several different times. This is called Taking repeated
measurements.
Multivariate Analysis of Variance:
The R matrix has the residual sum of squares for each of the c dimensions
stored on its leading diagonal. The off-diagonal elements are the residual
sums of cross-products for pairs of dimensions.
The residual degrees of freedom is exactly the same as it would be for a
univariate linear model with the same set of inputs, namely m minus the
number of linearly independent inputs. If we call the residual degrees of
freedom v, then we can estimate S by
if we have model i7o which is a special case of model i7i, and the residual
sums of squares and products matrices for these models are RQ and i?i,
respectively, then the extra sums of squares and products matrix is
we can build a multivariate ANOVA, or MANOVA. Instead of the sums
of squares of the ANOVA table, the MANOVA table contains sums of
squares and products matrices.
Marginal modeling:
Subject-specific modeling:
is regression modeling in which we allow some of the parameters to vary between
individuals, or groups of individuals
An example of this sort of model is
Analyzing samples that are in the form of observations that are vectors. Principal
components analysis is a method of transforming the vector observations into a new set
of vector observations.
The goal of this transformation is to concentrate the information about the differences
between observations into a small number of dimensions.
This allows most of the structure of the sample to be seen by plotting a small number of
the new variables against one another.
Principal Components Analysis:
The individuals are iris flowers; the four dimensions correspond to the sepal lengths and
widths, and the petal lengths and widths of the flowers. It is known that there are three
different species of iris represented in this set of data, but we are not using this
information in the analysis.
These data are very famous and are usually known as Fisher's Iris Data.
The drawbacks with pairwise scatterplots are as follows.
— A high number of dimensions leads to a very complicated picture.
— It is often difficult to locate the same observation in several different panels
(plots).
— There is an implicit assumption that the structure within the data is related
to pairs of features.
1. The elements of y are uncorrelated and the sample variance of the ith principal
component score is Aj. In other words the sample variance matrix of y
is
2. The sum of the sample variances for the principal components is equal to
the sum of the sample variances for the elements of x.
Effective Dimensionality:
There are three main ways of determining how many principal components are required to
obtain an adequate representation of the data.
2. The size of important variance The idea here is to consider the variance if all directions were
equally important. .
approximately
Normalising:
Interpretation :
we need a constraint on w;
Correspondence Analysis:
NOTATION:
Chi-squared distance :
It turns out that ~x^ can be decomposed in a way that
will allow us to highlight the patterns in the incidence matrix. This decomposition
arises from consideration of row profiles.
The row profiles of an incidence matrix are defined to be the vectors.
Kernel Methods (KM) are a relatively new family of algorithms that presents a series of useful
features for pattern analysis in datasets. In recent years, their simplicity, versatility and
efficiency have made them a standard tool for practitioners, and a fundamental topic in many
data analysis courses.
KMs combine the simplicity and computational efficiency of linear algorithms, such as the
perceptron algorithm or ridge regression, with the flexibility of non-linear systems, such as
for example neural networks, and statistical approaches such as regularization methods in
multivariate statistics.
Kernel Perceptron
The main idea of Kernel Methods is to first embed the data into a suitable vector space, and
then use simple linear methods to detect relevant patterns in the resulting set of points. If
the embedding map is non-linear, this enable to discover non-linear relations using linear
algorithms. Hence, we consider a map from the input space X to a feature space F,
Such a mapping of itself will not solve the problem, but it can become very effective if coupled
with the following two observations:
1) The information about the relative positions of the data points in the embedding space
encoded in the inner products between them is all that is needed by many pattern analysis
algorithms;
2) The inner products between the projections of data inputs into high dimensional embedding
spaces can be computed directly from the inputs using a so-called kernel function.
A simple rewriting of the perceptron rule yields an alternative representation for functions and
learning algorithms, known as the dual representation. In the dual representation, all that is
needed are the inner products between data points. There are many linear learning algorithms
that can be represented in this way.
The simplest iterative algorithm for learning linear classifications is the procedure proposed by
Frank Rosenblatt in 1959 for the perceptron. The algorithm is shown in Table
The algorithm updates the weight vector and bias directly, something that we will refer to as the
primal form in contrast to an alternative dual representation which we will introduce below.
This procedure is guaranteed to converge provided there exists a hyperplane that correctly
classifies the training data. In this case we say that the data are linearly separable. If no such
hyperplane exists the data are said to be nonseparable.
Dual Representation. It is important to note that the perceptron algorithm works by adding
misclassified positive training examples or subtracting misclassified negative examples to an
initial arbitrary weight vector. If we take the initial weight vector to be the zero vector, this
implies that the final weight vector will be a linear combination of the training points:
where, since the sign of the coefficient of Xj is given by the classification yj, the αj are positive
integral values equal to the number of times misclassification of X j has caused the weight to be
updated. Points that have caused fewer mistakes will have smaller α j, whereas difficult points
will have large values. Once a sample S has been fixed, the vector α is a representation of the
weight vector in different coordinates, known as the dual representation.
Furthermore, the perceptron algorithm can also be implemented entirely in this dual form.
This alternative formulation of the perceptron algorithm and its decision function has many
interesting properties. For example, the fact that the points that are harder to learn have larger α i
can be used to rank the data according to their information content.
It is important to note that the information from the training data only enters the algorithm
through inner products of the type (xi, Xj): in other words we do not need the coordinates of the
points, just the inner products between all pairs. In order to run this algorithm in a feature space,
it will be sufficient to compute the value of the inner products between the data in that space, and
that these can often be efficiently computed using a special function known as kernel.
The first one is the classical Principal Components Analysis (PCA), aimed at finding a low
dimensional representation of the data, the second one is Canonical Correlation Analysis
(CCA), aimed at finding correlations between a dataset formed by pairs of vectors or,
equivalently, between two 'matched' datasets, whose elements are in bijection.
It is a classical technique for analysing high dimensional data, and extracting hidden structure by
finding a (small) set of coordinates that carry most of the information in the data.
It is proved by giving the k principal eigenvectors of the data. For any fixed k-it is the
mean square distance between the original and the reconstructed.
The eigenvectors are called the principal axes of the data, and the new coordinates of
each point are obtained by projecting it onto the first k principal axes.
We will represent such vectors in dual form, as linear combinations of data vectors
and need to find the parameters Given a set of (unlabeled)
observations S = {xi,... , x™} that are centered, the co-variance matrix is defined as:
and is a positive semi-definite matrix. Its eigenvectors and eigenvalues can be written as
that is each eigenvector can be written as linear combination of the training data.
For some hence allowing a dual representation and the use of kernels.
The new coordinates of a point are then given by projecting it onto the eigenvectors, and
it is possible to prove that for any k using the first k eigenvectors gives the best
approximation in that it minimises the sum of the 2-norms of the residuals of the training
points.
By performing the same operation in the feature space, that is using the images of the
points with simple manipulations we can find that the coefficients a" of the n-
th eigenvector can be obtained by solving the eigenvalue problem
It is a technique (also introduced by Hotelling) that can be used when we have two
datasets that might have some underlying correlation.
Assume there is a bijection between the elements of the two sets, possibly corresponding
to two alternative descriptions of the same object (e.g., two views of the same 3D object;
or two versions of the same document in 2 languages).
Given a set of pairs S = {(x1,x2)j}, the task of CCA is to find linear combinations of
variables in each of the two sets that have the maximum mutual correlation. Given two
real valued variables a and b with zero mean, we define their correlation as
We obtain as another
generalized eigenvalue problem to find the dual variables a:
where Ki and K2 are the kernel matrices for the vectors X i1 belongs to X1 and Xi2 belongs to X2
assumption are in bijection, that is the ij-th entry in each matrix corresponds to the same pair of
points.
By solving this problem, one can find nonlinear transformations of the data both in the
first and in the second set that maximize the correlation between them.
One use of this approach can be to analyze two different representations of the same
object, possibly translations in different languages of the same documents, or different
views of the same object in machine vision.
CLUSTER ANALYSIS
Clustering is the process of making a group of abstract objects into classes of similar objects.
While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable to changes
and helps single out useful features that distinguish different groups.
Clustering can also help marketers discover distinct groups in their customer base. And
they can characterize their customer groups based on the purchasing patterns.
In the field of biology, it can be used to derive plant and animal taxonomies, categorize
genes with similar functionalities and gain insight into structures inherent to
populations.
Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to
house type, value, and geographic location.
Clustering also helps in classifying documents on the web for information discovery.
Clustering is also used in outlier detection applications such as detection of credit card
fraud.
As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
Requirements of Clustering
Scalability − We need highly scalable clustering algorithms to deal with large databases.
High dimensionality − The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
Interpretability − The clustering results should be interpretable, comprehensible, and
usable.
• Interval-scaled variables
• Binary variables
Interval-scaled variables
Binary variables
Ordinal Variables
Ratio Variables
• All objects considered in data mining are not relational => complex types of data
o examples of such data are spatial data, multimedia data, genetic data, time-series
data, text data and data collected from World-Wide Web
o can, for example, mean using of string and/or sequence matching, or methods of
information retrieval
CLUSTERING METHODS
Clustering methods can be classified into the following categories −
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify
the data into k groups, which satisfy the following requirements −
For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is formed.
There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the termination
condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects
in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is
down until each object in one cluster or the termination condition holds. This method is rigid,
i.e., once a merging or splitting is done, it can never be undone.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the given
cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data
point within a given cluster, the radius of a given cluster has to contain at least a minimum
number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of cells
that form a grid structure.
Advantage
It is dependent only on the number of cells in each dimension in the quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a given
model. This method locates the clusters by clustering the density function. It reflects spatial
distribution of the data points.
This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering
methods.
Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen
to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in
areas such as medicine, where DNA microarray technology can produce a large number of
measurements at once, and the clustering of text documents, where, if a word-frequency vector is
used, the number of dimensions equals the size of the vocabulary.
Four problems need to be overcome for clustering in high-dimensional data:
Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential
growth of the number of possible values with each dimension, complete enumeration of all
subspaces becomes intractable with increasing dimensionality. This problem is known as
the curse of dimensionality.
The concept of distance becomes less precise as the number of dimensions grows, since the
distance between any two points in a given dataset converges. The discrimination of the
nearest and farthest point in particular becomes meaningless:
A cluster is intended to group objects that are related, based on observations of their
attribute's values. However, given a large number of attributes some of the attributes will
usually not be meaningful for a given cluster. For example, in newborn screening a cluster of
samples might identify newborns that share similar blood values, which might lead to
insights about the relevance of certain blood values for a disease. But for different diseases,
different blood values might form a cluster, and other values might be uncorrelated. This is
known as the local feature relevance problem: different clusters might be found in different
subspaces, so a global filtering of attributes is not sufficient.
Given a large number of attributes, it is likely that some attributes are correlated. Hence,
clusters might exist in arbitrarily oriented affine subspaces.
PREDICTIVE ANALYTICS
Predictive analytics is the branch of advanced analysis. It is used to make predictions about
unknown future events. The Predictive analysis contains data collection, statistics, and
deployment. It uses many techniques from data mining, statistics, machine learning and analyzes
current data to make predictions about future. It also allows the business users to create
Predictive intelligence.
Predictive Analysis Process
Define Project -It includes Project outcomes, business objectives, deliverables, scoping of the
effects.
Data Collection – For predictive analysis, it collects data from different sources to analysis.
Thus it provides a complete view of customer interactions.
Data Analysis – It is the process of cleaning, transforming, inspecting and modeling data. The
goal of this process is to discover useful information.
Statistics – This process enables to confirm the assumptions. Hence it uses the assumption to test
using a statistical model
Modeling – An accurate predictive model about future is been created using predictive
modeling. There are also options to choose the best model.
b) Do more than evaluating the past – Employee analysis helps to check the company details.
It will summarize past failure or past success. Therefore, the most important thing is that
predictive analysis helps in learning from past experiences.
c) Maintain business integrity by managing fraud – First of all, Fraud investigators can look
into only a set number of cases each week. Secondly, they use company’s past experience to
score transactions according to their level of risk.
d) Advance your core business Capability – The next step to growth is to improve company
core offering. Thus At its core, it focuses on using it to optimize your approach to the market.
c) Fraud detection – it can find inaccurate credit applications and identify false claims.
DATA ANALYSIS USING R
Data analysis is increasingly gaining popularity, and the question of how to perform data
analytics using R is also becoming important, due to the importance of R as a tool that enables
data analysts to perform data analysis and visualization.
An important term coined in data analytics using R is exploratory data analysis, it is an approach
of data analysis employed for summarizing and visualizing data set, the concept is designed by
John Tukey, and the focus of the approach is to analyze data’s basic structures and variables to
develop a basic understanding of the data set, in order to develop an in-depth understanding of
the data’s origin and to investigate what methods of statistical analysis would be appropriate for
data analysis.
R is a software adapted by statistical experts as a standard software package for data analysis,
there are other data analysis. Currently R is a free software that can be downloaded for free on
Windows, Linux, Unix or OS X.
R-Console: Using R console, analysts can write codes for running the data, and also view the
output codes later, the codes can be written using R Script.
R-Script: R script is the interface where analysts can write codes, the process is quite simple,
users just have to write the codes and then to run the codes they just need to press Ctrl+
Enter, or use the “Run” button on top of R Script.
R Environment: R environment is the space to add external factors, this involves adding the
actual data set, then adding variables, vectors and functions to run the data. You can add all
your data here and then also view whether your data has been loaded accurately in the
environment.
Graphical Output: Once all the scripts and codes are added and data sets and variables are
added to R, graphical output feature could be used to create graphs after the exploratory data
analysis is performed.
Data analytics with R is performed using four features of R, mentioned above, R console, R
script, R environment and Graphical output. The R programming for data science contains
different features and packages that can be installed to analyze different types of data, R data
analytics enable user to analyze different types of data such as:
Vector: Vector data sets group together objects from same class, e.g. a data set with vectors
could contain numeric, integers etc. However, R data analytics allows mixing of different
objects, i.e. different vectors can be grouped together for analysis.
Matrices: A matrix data set is created when a vector data set is divided into rows and
columns, the data contains the elements of the same class, but in matrix form the data
structure is two dimensional.
Data Frame: Data frame could be considered an advanced form of matrix, it is a matrix of
vectors with different elements, the difference between a matrix and a data frame is that a
matrix must have elements of the same class, but in data frame lists of different vectors with
different classes can be grouped together in a data frame.
List: List is a specific term used to describe a vector data set that groups together data from
different classes.
Apart from the R programming for data science that allows analysis of different types of data, R
data sciences allows for different types of variables to be added, such as:
Continuous Variables: continuous variables are variables that can be in any form of value, e.g.
decimal values can also be added to the data, such as 1, 2.5, 4.6, 7, etc.
Categorical Variables: categorical values can only be added in one form such as 1, 2, 3,4,5 etc.
Factors are used for representing categorical variables in data analytics with R.
Missing Values: There are different commands such as NA to perform calculations without the
missing values, but when the values are missing, it is important to use commands to indicate that
there are missing values in order to perform data analytics with R.
In addition to different types of data sets and variables, R programming for data sciences has
different control structures such as:
If, else: If is used to test a certain condition, this could be used to generally find a relation, such
as if x fails what would be the result on y?
For: For is a command used to execute a loop for certain number of times, for can be used to set
a fix number that an analyst want for the iterating.
While: While is used for testing a condition, and it lets the process continue only if the condition
analyzed is true. Once the initiated loop is executed then the condition can be tested again, if the
condition needs to be altered in case it’s not true, it must be done before using the while
command or the loop will be executed infinitely.
R is a powerful tool that helps not only in data analysis but communication of the results as well
through its feature of visual graphs and presentation, i.e. in the following picture:
R is an easy to use tool with an excellent interface, however learning it could take time, in order
to study for it, it is important to first understand in detail what the software is and what it does,
and that could be done both through independent research and professional analysis.