Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views34 pages

Unit III Data Analysis

Unit III focuses on data analysis techniques including regression modeling, multivariate analysis, and clustering methods. It covers statistical methods such as MANOVA, principal components analysis, and kernel methods, emphasizing their applications in predictive analytics using R. The document also discusses the importance of understanding data structures and relationships through various modeling approaches and techniques.

Uploaded by

Jenita C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views34 pages

Unit III Data Analysis

Unit III focuses on data analysis techniques including regression modeling, multivariate analysis, and clustering methods. It covers statistical methods such as MANOVA, principal components analysis, and kernel methods, emphasizing their applications in predictive analytics using R. The document also discusses the importance of understanding data structures and relationships through various modeling approaches and techniques.

Uploaded by

Jenita C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UNIT III DATA ANALYSIS

Statistical Methods: Regression modelling, Multivariate Analysis - Classification: SVM &


Kernel Methods - Rule Mining - Cluster Analysis, Types of Data in Cluster Analysis,
Partitioning Methods, Hierarchical Methods, Density Based Methods, Grid Based
Methods, Model Based Clustering Methods, Clustering High Dimensional Data - Predictive
Analytics – Data analysis using R.

REGRESSION MODELING
 The techniques have been developed in response to methods of data collection in which
the usual assumptions of regression modeling are not justified. Multivariate analysis of
variance concerns problems in which the output is a vector, rather than a single number.
One way to analyze this sort of data would be to model each element of the output
separately, but this ignores the possible relationship between different elements of the
output vector. In other words, this analysis would be based on the assumption that the
different elements of the output vector are not related to one another.
 Multivariate analysis of variance is a form of analysis that does allow for different
elements of the output vector to be correlated to one another
 Repeated measures of data arise when there is a wish to model an output over time. A
natural way to perform such a study is to measure the output and the inputs for the same
set of individuals at several different times. This is called Taking repeated
measurements.
Multivariate Analysis of Variance:

 Multivariate analysis of variance (MANOVA) is simply an ANOVA with


several dependent variables. ANOVA tests for the difference in means between
two or more groups, while MANOVA tests for the difference in two or more
vectors of means.
 Given this sort of data, we might be able to analyze it using a multivariate linear
model, which is
 This model can be fitted in exactly the same way as a linear model (by least squares
estimation). One way to do this fitting would be to fit a linear model to each of the c
dimensions of the output, one-at-a-time.

 Having fitted the model, we can obtain fitted values

 The analogue of the residual sum of squares from the (univariate)


linear model is the matrix of residual sums of squares and
products for the multivariate linear model. This matrix is defined
to be

 The R matrix has the residual sum of squares for each of the c dimensions
stored on its leading diagonal. The off-diagonal elements are the residual
sums of cross-products for pairs of dimensions.
 The residual degrees of freedom is exactly the same as it would be for a
univariate linear model with the same set of inputs, namely m minus the
number of linearly independent inputs. If we call the residual degrees of
freedom v, then we can estimate S by

 if we have model i7o which is a special case of model i7i, and the residual
sums of squares and products matrices for these models are RQ and i?i,
respectively, then the extra sums of squares and products matrix is
 we can build a multivariate ANOVA, or MANOVA. Instead of the sums
of squares of the ANOVA table, the MANOVA table contains sums of
squares and products matrices.

 The names of the four commonly used test statistics are:


 Roy's greatest root;
 The Lawley-Hotelling trace;
 The Pillai trace;
 Wilks' lambda.

Repeated Measures Data:

 Repeated measures data are generated when the output variable is


observed at several points in time, on the same individuals.
 the covariates are also observed at the same time points as the output; so
the inputs are time-dependent too.
 Repeated measures data are often called longitudinal data.
 The term cross-sectional is often used to mean 'not longitudinal'.
 There are several ways to analyze this sort of data.
 As well as using the MANOVA, another straightforward approach is to
regress the outputs against time for each individual and analyze particular
parameter estimates.
 The three most common ways to analyze longitudinal data, at least in the
statistical research literature, are marginal modeling, subject-specific
modeling and transition modeling.

Marginal modeling:

 The primary complication is that it is unrealistic to assume that there is no


correlation between outputs from the same individual
 We could make no assumptions about the correlation structure, as we
did for the MANOVA, but usually there is reason to believe that the variance
matrix has a particular form. For example, if the individuals are (adult)
patients, then we might expect the correlation between the outputs in weeks 1
and 2 to be the same as that between outputs in weeks 5 and 6 (because the
pairs have the same time lapse between them), with the correlation between
outputs for weeks 2 and 5 being smaller.

Subject-specific modeling:
 is regression modeling in which we allow some of the parameters to vary between
individuals, or groups of individuals
 An example of this sort of model is

3. Random Effects Models


 The random effects in the subject specific model are used as a way to cope with
the fact that outputs from one individual are likely to be more similar to one
another than outputs from different individuals are.
 There are several other situations in which random effects could be included in a
model. Some of the main uses of random effects in modeling are given below:

a) Overdispersion: Overdispersion is the phenomenon of the observed


variability of the output being greater than the fitted model predicts .
There are two main ways in which overdispersion might arise, assuming
that we have identified the correct output distribution. One way is that we
failed to record a variable that ought to have been included as an input.
The reasons for not recording the missing input might include: time or
expense of recording it; an incorrect belief that it was not relevant to the
study; we do not actually know how to measure it. The second way is that
the expected output for an individual is not entirely determined by the
inputs, and that there is variation in the expected output between different
individuals who have the same values for the inputs.
b) Hierarchical Models: The hierarchy which gives its name to hierarchical
models is, for the turnip experiment, made up of two levels, the upper
level being the four blocks and the lower level being the plots within the
blocks.

CLASSICAL MULTIVARIATE ANALYSIS

 Analyzing samples that are in the form of observations that are vectors. Principal
components analysis is a method of transforming the vector observations into a new set
of vector observations.
 The goal of this transformation is to concentrate the information about the differences
between observations into a small number of dimensions.
 This allows most of the structure of the sample to be seen by plotting a small number of
the new variables against one another.
Principal Components Analysis:

 Principal components analysis is a way of transforming a set of c-dimensional


vector observations, Xi, X2, Xm, into another set of c-dimensional vectors,
yi,y2, ym.
 The y's have the property that most of their information content
is stored in the first few dimensions (features).
 The idea is that this will allow us to reduce the data to a smaller number of
dimensions, with low information loss, simply by discarding some of the elements
of the y's.
 Activities that become possible after the dimensionality reduction
include:
obtaining (informative) graphical displays of the data in 2-D;
carrying out computer intensive methods on reduced data;
gaining insight into the structure of the data, which was not apparent in c dimensions.

 following shows a pairwise scatterplot for a data set with c = 4 dimensions


and m = 150 individuals.

 The individuals are iris flowers; the four dimensions correspond to the sepal lengths and
widths, and the petal lengths and widths of the flowers. It is known that there are three
different species of iris represented in this set of data, but we are not using this
information in the analysis.
 These data are very famous and are usually known as Fisher's Iris Data.
 The drawbacks with pairwise scatterplots are as follows.
— A high number of dimensions leads to a very complicated picture.
— It is often difficult to locate the same observation in several different panels
(plots).
— There is an implicit assumption that the structure within the data is related
to pairs of features.

 An alternative, but equivalent, approach


is to find the line that is closest to the x's; here we measure the distance from
the x's to a line by finding the squared perpendicular distance from each Xj to
the line, then summing these squared distances for all the Xj's.
is called the vector of principal component scores of x. The ith principal component

score of sometimes the principal component scores are


referred to as the principal components.

The principal component scores have two interesting properties:

1. The elements of y are uncorrelated and the sample variance of the ith principal
component score is Aj. In other words the sample variance matrix of y
is

2. The sum of the sample variances for the principal components is equal to
the sum of the sample variances for the elements of x.

There are two more issues:


 How many of the principal components are needed to get a good representation
of the data? That is, what is the effective dimensionality of the data?
 Should we normalise the data before carrying out principal components analysis?

Effective Dimensionality:
There are three main ways of determining how many principal components are required to
obtain an adequate representation of the data.

a 2-D picture will be considered a reasonable representation if

2. The size of important variance The idea here is to consider the variance if all directions were
equally important. .
approximately

3. Scree diagram A scree diagram is an index plot of the principal component


variances. In other words it is a plot of Aj against i.

Normalising:

Interpretation :

 The final part of a principal components analysis is to inspect


the eigenvectors in the hope of identifying a meaning for the (important) principal
components.
 For the normalised Iris Data the eigenvector matrix {E) is

and the eigenvalues (the A's) are


Principal Components and Neural Networks:
 It is stated that principal components analysis can be performed using neural network and
Hebbian learning.
 Here we will demonstrate how principal components analysis
is related to Hebbian learning.
 In the neural network context, the network produces an output, o, from a
set of inputs, x, using a set of weights w. The relationship between these three
components is

we considered the problem of finding a so as to maximise


the variance of y, where

we need a constraint on w;
Correspondence Analysis:

 Correspondence analysis is a way to represent the structure within incidence


matrices.
 Incidence matrices are also called two-way contingency tables.
 An example of a (5 x 4) incidence matrix, with marginal totals, is shown in Table 3.17.
 The aim is to produce a picture that shows which groups of staff have which smoking
habits.
 There is an implicit assumption that smoking category and staff group are related.
 The first step towards this goal is to transform the incidence matrix into something that is
related more directly to association between the two variables.

NOTATION:
Chi-squared distance :
 It turns out that ~x^ can be decomposed in a way that
will allow us to highlight the patterns in the incidence matrix. This decomposition
arises from consideration of row profiles.
 The row profiles of an incidence matrix are defined to be the vectors.

SVM & Kernel Methods

Kernel Methods (KM) are a relatively new family of algorithms that presents a series of useful
features for pattern analysis in datasets. In recent years, their simplicity, versatility and
efficiency have made them a standard tool for practitioners, and a fundamental topic in many
data analysis courses.

KMs combine the simplicity and computational efficiency of linear algorithms, such as the
perceptron algorithm or ridge regression, with the flexibility of non-linear systems, such as
for example neural networks, and statistical approaches such as regularization methods in
multivariate statistics.

Kernel Perceptron
The main idea of Kernel Methods is to first embed the data into a suitable vector space, and
then use simple linear methods to detect relevant patterns in the resulting set of points. If
the embedding map is non-linear, this enable to discover non-linear relations using linear
algorithms. Hence, we consider a map from the input space X to a feature space F,

Such a mapping of itself will not solve the problem, but it can become very effective if coupled
with the following two observations:

1) The information about the relative positions of the data points in the embedding space
encoded in the inner products between them is all that is needed by many pattern analysis
algorithms;

2) The inner products between the projections of data inputs into high dimensional embedding
spaces can be computed directly from the inputs using a so-called kernel function.

These two points can be explained through kernel perceptron.

Primal and Dual Representation

A simple rewriting of the perceptron rule yields an alternative representation for functions and
learning algorithms, known as the dual representation. In the dual representation, all that is
needed are the inner products between data points. There are many linear learning algorithms
that can be represented in this way.
The simplest iterative algorithm for learning linear classifications is the procedure proposed by
Frank Rosenblatt in 1959 for the perceptron. The algorithm is shown in Table

The algorithm updates the weight vector and bias directly, something that we will refer to as the
primal form in contrast to an alternative dual representation which we will introduce below.

This procedure is guaranteed to converge provided there exists a hyperplane that correctly
classifies the training data. In this case we say that the data are linearly separable. If no such
hyperplane exists the data are said to be nonseparable.

Dual Representation. It is important to note that the perceptron algorithm works by adding
misclassified positive training examples or subtracting misclassified negative examples to an
initial arbitrary weight vector. If we take the initial weight vector to be the zero vector, this
implies that the final weight vector will be a linear combination of the training points:

where, since the sign of the coefficient of Xj is given by the classification yj, the αj are positive
integral values equal to the number of times misclassification of X j has caused the weight to be
updated. Points that have caused fewer mistakes will have smaller α j, whereas difficult points
will have large values. Once a sample S has been fixed, the vector α is a representation of the
weight vector in different coordinates, known as the dual representation.

In the dual representation the decision function can be expressed as follows:

Furthermore, the perceptron algorithm can also be implemented entirely in this dual form.

This alternative formulation of the perceptron algorithm and its decision function has many
interesting properties. For example, the fact that the points that are harder to learn have larger α i
can be used to rank the data according to their information content.

It is important to note that the information from the training data only enters the algorithm
through inner products of the type (xi, Xj): in other words we do not need the coordinates of the
points, just the inner products between all pairs. In order to run this algorithm in a feature space,
it will be sufficient to compute the value of the inner products between the data in that space, and
that these can often be efficiently computed using a special function known as kernel.

Kernel PCA and CCA

The first one is the classical Principal Components Analysis (PCA), aimed at finding a low
dimensional representation of the data, the second one is Canonical Correlation Analysis
(CCA), aimed at finding correlations between a dataset formed by pairs of vectors or,
equivalently, between two 'matched' datasets, whose elements are in bijection.

Principal Components Analysis:

It is a classical technique for analysing high dimensional data, and extracting hidden structure by
finding a (small) set of coordinates that carry most of the information in the data.
 It is proved by giving the k principal eigenvectors of the data. For any fixed k-it is the
mean square distance between the original and the reconstructed.
 The eigenvectors are called the principal axes of the data, and the new coordinates of
each point are obtained by projecting it onto the first k principal axes.
 We will represent such vectors in dual form, as linear combinations of data vectors
and need to find the parameters Given a set of (unlabeled)
observations S = {xi,... , x™} that are centered, the co-variance matrix is defined as:

and is a positive semi-definite matrix. Its eigenvectors and eigenvalues can be written as

that is each eigenvector can be written as linear combination of the training data.

For some hence allowing a dual representation and the use of kernels.
 The new coordinates of a point are then given by projecting it onto the eigenvectors, and
it is possible to prove that for any k using the first k eigenvectors gives the best
approximation in that it minimises the sum of the 2-norms of the residuals of the training
points.
 By performing the same operation in the feature space, that is using the images of the
points with simple manipulations we can find that the coefficients a" of the n-
th eigenvector can be obtained by solving the eigenvalue problem

and subsequently imposing the normalization


Although we do not have the explicit coordinates of the eigenvectors, we can always use them
for calculating the projections of the data points onto the n-th eigenvector vn, as follows .

Canonical Correlation Analysis (CCA)

 It is a technique (also introduced by Hotelling) that can be used when we have two
datasets that might have some underlying correlation.
 Assume there is a bijection between the elements of the two sets, possibly corresponding
to two alternative descriptions of the same object (e.g., two views of the same 3D object;
or two versions of the same document in 2 languages).
 Given a set of pairs S = {(x1,x2)j}, the task of CCA is to find linear combinations of
variables in each of the two sets that have the maximum mutual correlation. Given two
real valued variables a and b with zero mean, we define their correlation as
We obtain as another
generalized eigenvalue problem to find the dual variables a:

where Ki and K2 are the kernel matrices for the vectors X i1 belongs to X1 and Xi2 belongs to X2
assumption are in bijection, that is the ij-th entry in each matrix corresponds to the same pair of
points.

 By solving this problem, one can find nonlinear transformations of the data both in the
first and in the second set that maximize the correlation between them.
 One use of this approach can be to analyze two different representations of the same
object, possibly translations in different languages of the same documents, or different
views of the same object in machine vision.

CLUSTER ANALYSIS

Clustering is the process of making a group of abstract objects into classes of similar objects.

 A cluster of data objects can be treated as one group.

 While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.
 The main advantage of clustering over classification is that, it is adaptable to changes
and helps single out useful features that distinguish different groups.

Applications of Cluster Analysis


 Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.

 Clustering can also help marketers discover distinct groups in their customer base. And
they can characterize their customer groups based on the purchasing patterns.

 In the field of biology, it can be used to derive plant and animal taxonomies, categorize
genes with similar functionalities and gain insight into structures inherent to
populations.

 Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to
house type, value, and geographic location.

 Clustering also helps in classifying documents on the web for information discovery.

 Clustering is also used in outlier detection applications such as detection of credit card
fraud.

 As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.

Requirements of Clustering
 Scalability − We need highly scalable clustering algorithms to deal with large databases.

 Ability to deal with different kinds of attributes − Algorithms should be capable to be


applied on any kind of data such as interval-based (numerical) data, categorical, and
binary data.

 Discovery of clusters with attribute shape − The clustering algorithm should be


capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.

 High dimensionality − The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.

 Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
 Interpretability − The clustering results should be interpretable, comprehensible, and
usable.

TYPES OF DATA IN CLUSTER ANALYSIS

• Interval-scaled variables

• Binary variables

• Nominal, ordinal, and ratio variables

• Variables of mixed types

• Complex data types

Interval-scaled variables

• Continuous measurements of a roughly linear scale

• For example, weight, height and age

• The measurement unit can affect the cluster analysis

• To avoid dependence on the measurement unit, we should standardize the data

Binary variables

• A binary variable has only two states: 0 or 1

• A contingency table for binary data


Nominal Variables

Ordinal Variables
Ratio Variables

• A positive measurement on a nonlinear scale, approximately at exponential scale

• for example, AeBt or Ae-Bt

Variables of Mixed Types

• A database may contain all the six types of variables

Complex data types

• All objects considered in data mining are not relational => complex types of data

o examples of such data are spatial data, multimedia data, genetic data, time-series
data, text data and data collected from World-Wide Web

• Often totally different similarity or dissimilarity measures than above

o can, for example, mean using of string and/or sequence matching, or methods of
information retrieval

CLUSTERING METHODS
Clustering methods can be classified into the following categories −

 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify
the data into k groups, which satisfy the following requirements −

 Each group contains at least one object.

 Each object must belong to exactly one group.

 For a given number of partitions (say k), the partitioning method will create an initial
partitioning.

 Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.

Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is formed.
There are two approaches here −

 Agglomerative Approach
 Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the termination
condition holds.

Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects
in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is
down until each object in one cluster or the termination condition holds. This method is rigid,
i.e., once a merging or splitting is done, it can never be undone.

Approaches to improve Quality of Hierarchical Clustering


Here are the two approaches that are used to improve the quality of hierarchical clustering −

 Perform careful analysis of object linkages at each hierarchical partitioning.


 Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-clustering on
the micro-clusters.

Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the given
cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data
point within a given cluster, the radius of a given cluster has to contain at least a minimum
number of points.

Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of cells
that form a grid structure.

Advantage

 The major advantage of this method is fast processing time.

 It is dependent only on the number of cells in each dimension in the quantized space.

Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a given
model. This method locates the clusters by clustering the density function. It reflects spatial
distribution of the data points.

This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering
methods.

CLUSTERING HIGH DIMENSIONAL DATA

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen
to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in
areas such as medicine, where DNA microarray technology can produce a large number of
measurements at once, and the clustering of text documents, where, if a word-frequency vector is
used, the number of dimensions equals the size of the vocabulary.
Four problems need to be overcome for clustering in high-dimensional data:

 Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential
growth of the number of possible values with each dimension, complete enumeration of all
subspaces becomes intractable with increasing dimensionality. This problem is known as
the curse of dimensionality.
 The concept of distance becomes less precise as the number of dimensions grows, since the
distance between any two points in a given dataset converges. The discrimination of the
nearest and farthest point in particular becomes meaningless:

 A cluster is intended to group objects that are related, based on observations of their
attribute's values. However, given a large number of attributes some of the attributes will
usually not be meaningful for a given cluster. For example, in newborn screening a cluster of
samples might identify newborns that share similar blood values, which might lead to
insights about the relevance of certain blood values for a disease. But for different diseases,
different blood values might form a cluster, and other values might be uncorrelated. This is
known as the local feature relevance problem: different clusters might be found in different
subspaces, so a global filtering of attributes is not sufficient.
 Given a large number of attributes, it is likely that some attributes are correlated. Hence,
clusters might exist in arbitrarily oriented affine subspaces.

PREDICTIVE ANALYTICS

Predictive analytics is the branch of advanced analysis. It is used to make predictions about
unknown future events. The Predictive analysis contains data collection, statistics, and
deployment. It uses many techniques from data mining, statistics, machine learning and analyzes
current data to make predictions about future. It also allows the business users to create
Predictive intelligence.
Predictive Analysis Process

Define Project -It includes Project outcomes, business objectives, deliverables, scoping of the
effects.

Data Collection – For predictive analysis, it collects data from different sources to analysis.
Thus it provides a complete view of customer interactions.

Data Analysis – It is the process of cleaning, transforming, inspecting and modeling data. The
goal of this process is to discover useful information.

Statistics – This process enables to confirm the assumptions. Hence it uses the assumption to test
using a statistical model

Modeling – An accurate predictive model about future is been created using predictive
modeling. There are also options to choose the best model.

Deployment – To deploy the analytical results into everyday decision-making.

Model Monitoring – To ensure that it is providing an expected result, we have to manage


model.
Need of Predictive Analysis

a) Secure a competitive Stronghold – It helps to know the competitors’ weaknesses and


company’s strengths. Hence, it allows to check the actions of consumers and competitors’
marketing and sales.

b) Do more than evaluating the past – Employee analysis helps to check the company details.
It will summarize past failure or past success. Therefore, the most important thing is that
predictive analysis helps in learning from past experiences.

c) Maintain business integrity by managing fraud – First of all, Fraud investigators can look
into only a set number of cases each week. Secondly, they use company’s past experience to
score transactions according to their level of risk.

d) Advance your core business Capability – The next step to growth is to improve company
core offering. Thus At its core, it focuses on using it to optimize your approach to the market.

Applications of Predictive Analytics

a) Customer Relationship Management – it helps to achieve objectives such as customer


services, Marketing. Analytic customer relationship management applied throughout the
customer lifecycle.

b) Collection analysis – Predictive analysis optimizes the collection of allocation resources. It


also helps in increasing the recovery and also reducing the collection costs.

c) Fraud detection – it can find inaccurate credit applications and identify false claims.
DATA ANALYSIS USING R

Data analysis is increasingly gaining popularity, and the question of how to perform data
analytics using R is also becoming important, due to the importance of R as a tool that enables
data analysts to perform data analysis and visualization.

An important term coined in data analytics using R is exploratory data analysis, it is an approach
of data analysis employed for summarizing and visualizing data set, the concept is designed by
John Tukey, and the focus of the approach is to analyze data’s basic structures and variables to
develop a basic understanding of the data set, in order to develop an in-depth understanding of
the data’s origin and to investigate what methods of statistical analysis would be appropriate for
data analysis.

R is a software adapted by statistical experts as a standard software package for data analysis,
there are other data analysis. Currently R is a free software that can be downloaded for free on
Windows, Linux, Unix or OS X.

Why analyze data using R?

 Straightforward handling of analyses using simple calculations


 Easy to learn for beginners
 Simple and advanced options of analysis available
 Flexible
 Provides both application area and statistical area specialties
 Ability to easily fix mistakes
It is important to understand the basic interface of R. The R software has four basic features,
R Console, R Script, R environment and Graphical output. If all of these features are
summarized R has the ability to enable analysts to write codes in console, then run
commands through script, analyze variables and sets in R environment and then present the
data in the form of graphical output. In simple 4 steps, users can analyze data using R, by
performing following tasks:

R-Console: Using R console, analysts can write codes for running the data, and also view the
output codes later, the codes can be written using R Script.

R-Script: R script is the interface where analysts can write codes, the process is quite simple,
users just have to write the codes and then to run the codes they just need to press Ctrl+
Enter, or use the “Run” button on top of R Script.

R Environment: R environment is the space to add external factors, this involves adding the
actual data set, then adding variables, vectors and functions to run the data. You can add all
your data here and then also view whether your data has been loaded accurately in the
environment.
Graphical Output: Once all the scripts and codes are added and data sets and variables are
added to R, graphical output feature could be used to create graphs after the exploratory data
analysis is performed.

R programming for data science

Data analytics with R is performed using four features of R, mentioned above, R console, R
script, R environment and Graphical output. The R programming for data science contains
different features and packages that can be installed to analyze different types of data, R data
analytics enable user to analyze different types of data such as:

Vector: Vector data sets group together objects from same class, e.g. a data set with vectors
could contain numeric, integers etc. However, R data analytics allows mixing of different
objects, i.e. different vectors can be grouped together for analysis.

Matrices: A matrix data set is created when a vector data set is divided into rows and
columns, the data contains the elements of the same class, but in matrix form the data
structure is two dimensional.

Data Frame: Data frame could be considered an advanced form of matrix, it is a matrix of
vectors with different elements, the difference between a matrix and a data frame is that a
matrix must have elements of the same class, but in data frame lists of different vectors with
different classes can be grouped together in a data frame.

List: List is a specific term used to describe a vector data set that groups together data from
different classes.
Apart from the R programming for data science that allows analysis of different types of data, R
data sciences allows for different types of variables to be added, such as:

Continuous Variables: continuous variables are variables that can be in any form of value, e.g.
decimal values can also be added to the data, such as 1, 2.5, 4.6, 7, etc.

Categorical Variables: categorical values can only be added in one form such as 1, 2, 3,4,5 etc.
Factors are used for representing categorical variables in data analytics with R.

Missing Values: There are different commands such as NA to perform calculations without the
missing values, but when the values are missing, it is important to use commands to indicate that
there are missing values in order to perform data analytics with R.
In addition to different types of data sets and variables, R programming for data sciences has
different control structures such as:

If, else: If is used to test a certain condition, this could be used to generally find a relation, such
as if x fails what would be the result on y?

For: For is a command used to execute a loop for certain number of times, for can be used to set
a fix number that an analyst want for the iterating.

While: While is used for testing a condition, and it lets the process continue only if the condition
analyzed is true. Once the initiated loop is executed then the condition can be tested again, if the
condition needs to be altered in case it’s not true, it must be done before using the while
command or the loop will be executed infinitely.

R for Data Analysis

R is a powerful tool that helps not only in data analysis but communication of the results as well
through its feature of visual graphs and presentation, i.e. in the following picture:
R is an easy to use tool with an excellent interface, however learning it could take time, in order
to study for it, it is important to first understand in detail what the software is and what it does,
and that could be done both through independent research and professional analysis.

You might also like