Data Analysis
Data Analysis is a process of inspecting, cleaning, transforming and
modeling data with the goal of discovering useful information,
suggesting conclusions and supporting decision-making
Application/Uses of Data Analytics
Data analytics in finance –
Financial data analytics interprets time-series data to understand risks involved in
monetary operations. Possible future scenarios relating to finance can be generated by
analyzing past data trends.
Data analytics in logistics –
Data analytics is also applied in the fields of logistics and delivery. Data analysis in these
fields enables the logistic companies to figure which is the best route to take for delivery.
Data analytics in transportation -
Data analysis in this domain helps carry out a segment-wise analysis. This includes road safety,
road, air and rail, and traffic management, route monitoring for waterway transport, etc.
Data analytics in manufacturing –
sing big data analytics in manufacturing, the efficiency in the supply chain of vehicle
manufacturing is improved. Product customization in the manufacturing industries are also
made easier with the introduction of data analytics.
Data analytics in healthcare –
in discovering which treatment to make referring to the trends in a patient’s medical history.
Other than a medical perspective, healthcare data analytics also plays key when it comes to
management.
Fraud detection
Many organizations in different industries use data analytics to detect fraudulent activities. These
industries include pharmaceutical, banking, finance, tax, retail, etc.
Security
Security personnel use data analytics (especially predictive analytics) to find future cases of
crimes or security breaches. They can also investigate past or ongoing attacks. Analytics makes it
possible to analyze how IT systems were breached during an attack
Marketing and digital advertising
Marketers use data analytics to understand the audience and get high conversion rates. There are
different activities in these two sub-applications, which are done using data analytics. To
understand the audience, digital ad experts use analytics to know the intended audience’s likes,
dislikes, age, race, gender, and other features.
Need of data analysis
Informed Decision-Making
Improved Understanding
Competitive Advantage
Risk Mitigation
Efficient Resource Allocation
Continuous Improvement
what is data and different types of data
Data is a collection of raw information that consists of facts and figures. It can come in the form
of text, observations, figures, images, numbers, graphs, or symbols.
There are 3 types:
Structured data –
Structured data is data whose elements are addressable for effective analysis. It has been organized
into a formatted repository that is typically a database.
It concerns all data which can be stored in database SQL in a table with rows and columns.
Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but that has some
organizational properties that make it easier to analyze. With some processes, you can store them in
the relation database
Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does not have a
predefined data model, thus it is not a good fit for a mainstream relational database. So for
Unstructured data, there are alternative platforms for storing and managing,
Differences between Structured, Semi-structured and Unstructured data:
Structured Unstructured
Properties data Semi-structured data data
It is based on It is based on It is based on
Technology Relational XML/RDF(Resource character and
database table Description Framework). binary data
Matured
No transaction
transaction and
Transaction Transaction is adapted management
various
management from DBMS not matured and no
concurrency
concurrency
techniques
It is more flexible than It is more flexible
It is schema
structured data but less and there is
Flexibility dependent and
flexible than unstructured absence of
less flexible
data schema
It is very difficult
It’s scaling is simpler than It is more
scalability to scale DB
structured data scalable.
schema
Types of Data Analytics
Descriptive data analytics
This type of data analytics examines past data to explain what had happened. It is the most
straightforward data analytics technique
Diagnostic data analytics
Diagnostic data analytics examines past data to explain the cause of an anomaly. This type of
analytics aims to answer “why did this happen?” from a descriptive analytics result.
Predictive data analytics
Predictive data analytics involves using current or historical data to predict future actions.
Individuals and companies conduct predictive analysis by combining historical data with
machine learning
Prescriptive data analytics
Prescriptive data analytics involves selecting the best solution for a problem from available
options. This type of data analytics examines results from other analytics and gives guidance on
how to reach a specific answer
Real-time data analytics
Real-time data analytics involves using data immediately when entered into the database. Unlike
other types of data analytics that use data from past events (historical data), this type analyses
new data from customers or external sources on the go.
Augmented data analytics
Augmented analytics uses machine language (ML) and natural language processing (NLP) to
analyze data.
Data Analysis Process consists of the following phases
Data Requirements Specification
The data required for analysis is based on a question or an
experiment.
Data Collection
Data Collection is the process of gathering information on targeted
variables identified as data requirements. Data Collection ensures
that data gathered is accurate such that the related decisions are
valid.
Data Processing
The data that is collected must be processed or organized for
analysis. This includes structuring the data as required for the
relevant Analysis Tools.
Data Cleaning
The processed and organized data may be incomplete, contain
duplicates, or contain errors. Data Cleaning is the process of
preventing and correcting these errors.
Data Analysis
Data that is processed, organized and cleaned would be ready for
the analysis. Various data analysis techniques are available to
understand, interpret, and derive conclusions based on the
requirements.
Communication
The results of the data analysis are to be reported in a format as
required by the users to support their decisions and further action.
data analytics work
1. Data collection
2. Adjusting data quality
3. Building an analytical model
4. Presentation
What is R Programming
R programming is used as a leading tool for machine learning, statistics, and data analysis. Objects,
functions, and packages can easily be created by R.
It’s a platform-independent language.
It’s an open-source free language
R programming language is not only a statistic package but also allows us to integrate with other
languages
Features of R
R Packages:
Distributed Computing:
Data analysis:
R – Array
R Arrays consist of all elements of the same data type. Arrays are essential data storage structures
defined by a fixed number of dimensions
In R Programming Language Uni-dimensional arrays are called vectors. Two-dimensional arrays are
called matrices, consisting of fixed numbers of rows and columns.
Creating an Array
An R array can be created with the use of array() the function. A list of elements is passed to the
array() functions along with the dimensions as required.
Syntax:
array(data, dim = (nrow, ncol, nmat), dimnames=names)
R – Matrices
In R programming, matrices are two-dimensional, homogeneous data structures. In a matrix, rows are
the ones that run horizontally and columns are the ones that run vertically.
Creating a Matrix in R
To create a matrix in R you need to use the function called matrix().
The arguments to this matrix() are the set of elements in the vector.
Syntax to Create R-Matrix
matrix(data, nrow, ncol, byrow, dimnames)
R Vectors
R Vectors are the same as the arrays in R language which are used to hold multiple data values of the same type.
One major key point is that in R Programming Language the indexing of the vector will start from ‘1’ and not from
‘0’.
Types of R vectors
Numeric vectors
Numeric vectors are those which contain numeric values such as integer, float, etc.
Character vectors
Character vectors in R contain alphanumeric values and special characters.
Logical vectors
Logical vectors in R contain Boolean values such as TRUE, FALSE and NA for Null values.
R Factors
Factors in R Programming Language are data structures that are implemented to categorize the data or represent
categorical data and store it on multiple levels.
Attributes of Factors in R Language
x: It is the vector that needs to be converted into a factor.
Levels: It is a set of distinct values which are given to the input vector x.
Labels: It is a character vector corresponding to the number of labels.
Exclude: This will mention all the values you want to exclude.
Ordered: This logical attribute decides whether the levels are ordered.
Creating a Factor in R Programming Language
The command used to create or modify a factor in R language is – factor() with a vector as input.
Functions in R Programming
A function accepts input arguments and produces the output by executing valid R commands that are inside the
function.
Functions are useful when you want to perform a certain task multiple times.
In R Programming Language when you are creating a function the function name and the file in which you are
creating the function need not be the same
Creating a Function in R Programming
Functions are created in R by using the command function().
Packages in R
Packages in R Programming language are a set of R functions, compiled code, and sample data.
These are stored under a directory called “library” within the R environment. By default, R installs a
group of packages during installation. Once we start the R console, only the default packages are
available by default. Other packages that are already installed need to be loaded explicitly to be utilized
by the R program
explain 10 different R command
print(): Displays an R object on the R console.
read.table(): Reads files with labels in the first row.
help(): Obtains documentation for a given R command.
mtext(): Sets the title of a plot.
matplot(): Creates a matrix plot.
ls() Lists memory contents
rm() Removes an item from memory
boxplot() Produces a boxplot
data() Load built-in dataset
example() View some examples on the use of a command
Environments in R Programming
The environment is a virtual space that is triggered when an interpreter of a programming language is launched.
Environment can be assumed as a top-level object that contains the set of names/variables associated
with some values.
Create a New Environment
An environment in R programming can be created using new.env() function. Further, the variables can
be accessed using $ or [[ ]] operator. But, each variable is stored in different memory locations.
Syntax: new.env(hash = TRUE)
Or
Here are the steps to create an R environment
1. Start Navigator
2. Go to the Environments page
3. Click Create
4. Enter a descriptive name for your environment
5. Next to Packages, select version 3.7
6. Check the box next to R and select the version of R you want to use
7. Click Create
Control Statements in R Programming
Control statements are expressions used to control the execution and flow of the program based on the conditions
provided in the statements. These structures are used to make a decision after assessing the variable.
types of control statements
if condition
This control structure checks the expression provided in parenthesis is true or not. If true, the execution
of the statements in braces {} continues.
Syntax:
if(expression){
statements
....
....
}
if-else condition
It is similar to if condition but when the test expression in if condition fails, then statements
in else condition are executed.
Syntax:
if(expression){
statements
....
....
}
else{
statements
....
....
}
for loop
It is a type of loop or sequence of statements executed repeatedly until exit condition is reached.
Syntax:
for(value in vector){
statements
....
....
}
Nested loops
Nested loops are similar to simple loops. Nested means loops inside loop. Moreover, nested loops are
used to manipulate the matrix.
while loop
while loop is another kind of loop iterated until a condition is satisfied. The testing expression is
checked first before executing the body of loop.
Syntax:
while(expression){
statement
....
....
}
repeat loop and break statement
repeat is a loop which can be iterated many number of times but there is no exit condition to come out
from the loop. So, break statement is used to exit from the loop. break statement can be used in any
type of loop to exit from the loop.
Syntax:
repeat {
statements
....
....
if(expression) {
break
}
}
return statement
return statement is used to return the result of an executed function and returns control to the calling
function.
Syntax:
return(expression)
next statement
next statement is used to skip the current iteration without executing the further statements and
continues the next iteration cycle without terminating the loop.
Here is an example of a program working for matrix in R:
MATRIX PROGRAM
# Defining the column and row names.
row_names = c("row1", "row2", "row3", "row4")
ccol_names = c("col1", "col2", "col3")
#Creating matrix.
R <- matrix(c(5:16), nrow = 4, byrow = TRUE, dimnames =
list(row_names, ccol_names))
#Accessing element present on 3rd row and 2nd column.
print(R[3,2])
#Adding columns to a matrix with the cbind() function.
new_col = c(17, 18, 19, 20)
R <- cbind(R, new_col)
#Printing the updated matrix.
print(R)
Here is the output of the program:
[1] 11
col1 col2 col3 col4
row1 5 6 7 8
row2 9 10 11 12
row3 13 14 15 16
row4 17 18 19 20
This program first defines the column and row names for the matrix. Then, it creates the matrix using
the matrix() function, specifying the number of rows and columns, the data elements, and the row and
column names.