NASSCOM Future Skills -
Associative Data Analyst
Module - I
What is Data.?
• Data refers to raw and unprocessed information that is collected, stored, and used for various
purposes.
• It can be in the form of numbers, text, images, audio, video, and more.
• Data is the foundation of information, knowledge, and insights.
• It plays a crucial role in decision-making, analysis, and understanding various phenomena.
• Data can be collected from various sources, including sensors, surveys, transactions, social media,
and more.
• In this digital age, the amount of data being generated and collected has grown exponentially.
• This has led to the field of "big data," which deals with the challenges of storing, processing, and
analyzing massive volumes of data to derive valuable insights and knowledge.
Data Growth
Figure 1: Data growth
Data Growth
• Byte (B)
• Kilobyte (KB)
• Megabyte (MB)
• Gigabyte (GB)
• Terabyte (TB)
• Petabyte (PB)
• Exabyte (EB)
• Zettabyte (ZB)
• Yottabyte (YB)
Figure 2: Data growth
Major types of data
Structured Data:
• This type of data is organized and follows a specific format or structure.
• It is typically found in databases, spreadsheets, and tables.
• Structured data is easy to search, sort, and analyze using various tools and techniques.
Unstructured Data:
• Unstructured data doesn't have a predefined format and can include text, images,
audio, and video.
• This type of data is more challenging to analyze and process compared to structured
data, but it often holds valuable insights and information.
Major types of data
Figure 3: Structured vs Unstructured Data
What Is Data Analysis.?
• Data analysis is the process of cleaning, changing, and processing raw data and
extracting actionable, relevant information that helps businesses make
informed decisions.
• The procedure helps reduce the risks inherent in decision-making by providing
useful insights and statistics, often presented in charts, images, tables, and
graphs.
What Is the Data Analysis Process?
Data Collection
• Guided by your identified requirements, it’s time to collect the data from your sources.
Data Cleaning
• Not all of the data you collect will be useful, so it’s time to clean it up. This process is where you
remove white spaces, duplicate records, and basic errors.
Data Analysis
• Here is where you use data analysis software and other tools to help you interpret and
understand the data and arrive at conclusions. Data analysis tools include Excel, Python, R,
Looker, Rapid Miner, Chartio, Metabase, Redash, and Microsoft Power BI.
What Is the Data Analysis Process?
Data Interpretation
• Now that you have your results, you need to interpret them and come up with
the best courses of action based on your findings.
Data Visualization
• Data visualization is a fancy way of saying, “graphically show your information
in a way that people can read and understand it.”
What Is Data Analytics.?
• This is a subset of data analysis and focuses specifically on the processing and
analysis of datasets to draw conclusions, identify trends, and make predictions.
• Data analytics often involves the use of specialized tools and technologies to
extract insights from large volumes of data.
Techniques:
• Utilizes advanced statistical analysis, predictive modeling, machine learning,
and other sophisticated techniques to derive actionable insights.
Tools
Often involves the use of more advanced tools and technologies, including
programming languages like Python or R, as well as specialized analytics
platforms and frameworks.
•R • Tableau
• Python • KNIME
• Apache Spark • Steamlit
• Google Cloud AutoML
• SAS (Statistical Analysis System)
• Microsoft Power BI
Why R.?
• Open source
• Machine learning
• Compatibility
• Data handling
• Powerful graphics
• Complex arithmetic operations
• Highly active community
Download and Install R for Windows Environment
• To install R, go to cran.r-project.org
cran.r-project.org
Download and Install R for Windows Environment
Download and Install R for Windows Environment
Download and Install R for Windows Environment
Download and Install R for Windows Environment
Download and Install R for Windows Environment
Download and Install R for Windows Environment
R studio for Windows Environment
• To download R studio
• https://posit.co/downloads/
R studio for Windows Environment
R Variables
• R does not have a command for declaring a variable. A variable is created the moment you first assign a
value to it. To assign a value to a variable, use the <- sign. To output (or print) the variable value, just type
the variable name:
• name <- "John"
age <- 40
name # output "John"
age # output 40
Print Statement
R does have a print() function available if you want to use it.
name <- "John"
print(name) # print the value of the name variable
R Concatenate Elements
• You can also concatenate, or join, two or more elements, by using the paste() function.
• To combine both text and a variable, R uses comma (,):
• text <- "awesome"
paste("R is", text)
• text1 <- "R is"
text2 <- "awesome"
paste(text1, text2)
Multiple Variables
• R allows you to assign the same value to multiple variables in one line
• # Assign the same value to multiple variables in one line
var1 <- var2 <- var3 <- "Orange"
# Print variable values
var1
var2
var3
Variable Names
• A variable name must start with a letter and can be a combination of letters, digits, period(.)
• and underscore(_). If it starts with period(.), it cannot be followed by a digit.
• A variable name cannot start with a number or underscore (_)
• Variable names are case-sensitive (age, Age and AGE are three different variables)
• Reserved words cannot be used as variables (TRUE, FALSE, NULL, if...)
• # Legal variable names:
# Illegal variable names:
myvar <- "John" 2myvar <- "John"
my_var <- "John" my-var <- "John"
my var <- "John"
myVar <- "John"
_my_var <- "John"
MYVAR <- "John" my_v@ar <- "John"
myvar2 <- "John" TRUE <- "John"
.myvar <- "John"
Various Data Types
• Numeric
• Character
• Date
• Data frame
• Array
• Matrix
Numbers
There are three number types in R:
• Numeric x <- 10.5 # numeric
• Integer y <- 10L # integer
• Complex z <- 1i # complex
Numeric Integer Complex
A numeric data type is the most Integers are numeric data without A complex number is written with
common type in R, and contains decimals. To create an integer an "i" as the imaginary part
any number with or without a variable, you must use the letter L x <- 3+5i
decimal. after the integer value y <- 5i
x <- 10.5 x <- 1000L
y <- 55 y <- 55L
String
• Strings are used for storing text.
• A string is surrounded by either single quotation marks, or double quotation marks
• "hello“ or 'hello’
• str <- "Hello"
str # print the value of str
Multiline Strings String Length Check a String
You can assign a multiline string to a There are many useful string functions Use the grepl() function to check if a
variable like this in R. To find the number of characters in character or a sequence of characters
str <- “R is the language, a string, use the nchar() function: are present in a string.
Used to perform the Data str <- "Hello World!"
Analysis." str <- "Hello World!" grepl("H", str)
nchar(str) grepl("Hello", str)
str # print the value of grepl("X", str)
str
Date
• R programming language provides several functions that deal with date and time. These functions are used to format
and convert the date from one form to another form.
Table 1: format specifiers
Specifier Description Specifier Description
%a Abbreviated weekday %Y Year with century
%A Full weekday %d Day of month (01-31)
%b Abbreviated month %j Day in Year (001-366)
%B Full month %m Month of year (01-12)
Data in %m/%d/%y
%C Century %D
format
Weekday (01-07) Starts
%y Year without century %u
on Monday
Date
# today date
date<-Sys.Date()
Output
# abbreviated Day [1] "Sat"
format(date,format="%a") [1] "Saturday"
[1] "6"
# full Day
format(date,format="%A")
# weekday
format(date,format="%u")
# today date
Date
date<-Sys.Date()
# default format yyyy-mm-dd
date
# day in month
Output
format(date,format="%d")
[1] "2022-04-02"
# month in year
[1] "02"
format(date,format="%m")
[1] "04"
# abbreviated month
[1] "Apr"
format(date,format="%b")
[1] "April"
# full month
[1] "04/02/22"
format(date,format="%B") [1] "02-Apr-22"
# Date
format(date,format="%D")
format(date,format="%d-%b-%y")
Data Frames
• Data Frames are data displayed in a format as a table.
• Data Frames can have different types of data inside it. While the first column can be character, the second and third can
be numeric or logical. However, each column should have the same type of data.
• Use the dataframe() function to create a data frame
Data Frames
• # Create a data frame
Data_Frame <- data.frame (
T1 = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
# Print the data frame
Data_Frame
Data Frames
• Creating a DataFrame
• Accessing rows and columns
• Selecting the subset of the data frame
• Editing dataframes
• Adding extra rows and columns to the data frame
• Add new variables to dataframe based on existing ones
• Delete rows and columns in a data frame
Array
• Arrays can only have one data type. Arrays can have more than two dimensions.
• We can use the array() function to create an array, and the dim parameter to specify the dimensions:
• # An array with one dimension with values ranging from 1 to 24
thisarray <- c(1:24)
thisarray
# An array with more than one dimension
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray
How does dim=c(4,3,2) work?
• The first and second number in the bracket specifies the amount of rows and columns.
• The last number in the bracket specifies how many dimensions we want.
Access Array Items
• You can access the array elements by referring to the index position. You can use the [] brackets to access the desired
elements from an array
• The syntax is as follow: array[row position, column position, matrix level]
• thisarray <- c(1:24)
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray[2, 3, 2]
• You can also access the whole row or column from a matrix in an array, by using the c() function
• thisarray <- c(1:24)
# Access all the items from the first row from matrix one
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray[c(1),,1]
# Access all the items from the first column from matrix one
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray[,c(1),1]
Check if an Item Exists
• To find out if a specified tem is present in an array, use the %in% operator
• Example
• Check if the value "2" is present in the array:
• thisarray <- c(1:24)
multiarray <- array(thisarray, dim = c(4, 3, 2))
2 %in% multiarray
Array Length
• Use the length() function to find the dimension of an array
• thisarray <- c(1:24)
multiarray <- array(thisarray, dim = c(4, 3, 2))
length(multiarray)
Matrices
• A matrix is a two dimensional data set with columns and rows.
• A column is a vertical representation of data, while a row is a horizontal representation of data.
• A matrix can be created with the matrix() function. Specify the nrow and ncol parameters to get the
amount of rows and columns:
• # Create a matrix
thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
# Print the matrix
thismatrix
• can also create a matrix with strings
• thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
thismatrix
Access Matrix Items
• You can access the items by using [ ] brackets. The first number "1" in the bracket specifies the row-
position, while the second number "2" specifies the column-position:
• thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
thismatrix[1, 2]
• The whole row can be accessed if you specify a comma after the number in the bracket
• thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
thismatrix[2,]
• The whole column can be accessed if you specify a comma before the number in the bracket:
• thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
thismatrix[,2]
Accessing more than one Row:
• More than one row can be accessed if you use the c() function:
• thismatrix <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3,
ncol = 3)
thismatrix[c(1,2),]
Access More Than One Column
• More than one column can be accessed if you use the c() function:
• thismatrix <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3,
ncol = 3)
thismatrix[, c(1,2)]
Add Row and Columns
• Use the cbind() function to add additional columns in a Matrix
• Use Use the rbind() function to add additional rows in a Matrix
• thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon"
, "fig"), nrow = 3, ncol = 3)
newmatrix <- cbind(thismatrix, c("strawberry", "blueberry", "raspberry"))
• newmatrix <- rbind(thismatrix, c("strawberry", "blueberry", "raspberry"))
# Print the new matrix
newmatrix
Remove Row and Columns
• Use the c() function to remove rows and columns in a Matrix
• thismatrix <-
matrix(c("apple", "banana", "cherry", "orange", "mango", "pineapple"), nrow = 3,
ncol =2)
#Remove the first row and the first column
thismatrix <- thismatrix[-c(1), -c(1)]
thismatrix
Working with datasets and files
Reading Dataset:
CSV
amazon <- read.csv("D:/Balu/VIT Bhopal/Data Analysist/amazon.csv")
View(amazon)
TXT
Data <- read_csv("D:/Balu/VIT Bhopal/Data Analysist/Datasets/Data.txt")
View(Data)
XLS
Iris <- read_excel("D:/Balu/VIT Bhopal/Data Analysist/Datasets/Iris.xlsx")
View(Iris)
File operations in R
o Reading from Files o File Existence Checking
o Reading CSV Files:
o File Deletion
o Reading Text Files
o Working Directory
o Reading Excel Files
o Writing to Files:
o Writing to CSV Files
o Writing to Text Files
o Appending to Files:
Reading from Files
• Reading CSV Files: You can use the read.csv() function to read data from a CSV
file.
data <- read.csv("data.csv")
• Reading Text Files: The read.table() function can be used to read data from text
files
data <- read.table("data.txt", header = TRUE)
• Reading Excel Files: The readxl package provides functions to read data from
Excel files.
library(readxl)
data <- read_excel("data.xlsx")
Writing to Files
• Writing to CSV Files: You can use the write.csv() function to write data to a CSV
file
write.csv(data, "output.csv", row.names = FALSE)
• Writing to Text Files: The write.table() function can be used to write data to
text files.
write.table(data, "output.txt", row.names = FALSE)
Appending to Files:
• Appending to Text Files: If you want to append data to an existing text file, you
can use the write.table() function with the append = TRUE argument.
write.table(new_data, "existing_data.txt", append = TRUE,
row.names = FALSE, col.names = FALSE)
File Existence Checking:
• You can check if a file exists using the file.exists()
if (file.exists("data.csv")) {
# Do something
}
File Deletion:
• You can delete a file using the unlink() function
unlink("file_to_delete.txt")
Working Directory:
• You can set the working directory using the setwd() function
setwd("/path/to/directory")
Dataset
• A dataset in the context of data analysis and statistics refers to a structured collection of data
typically organized in a tabular format.
• Rows represent individual observations or instances, and columns represent variables or
attributes describing those observations.
Key components and characteristics of datasets
• Observations/Instances: Each row in a dataset represents a single observation or instance.
This could be a person, an event, a measurement, etc. For example, in a dataset about students,
each row might represent a different student.
• Variables/Attributes: Each column in a dataset represents a variable or attribute associated
with the observations. Variables can be of different types, including numerical (e.g., age, height),
categorical (e.g., gender, education level), or textual (e.g., name, description).
Dataset
• Tabular Structure: Datasets are commonly organized in a tabular structure, similar to a
spreadsheet, with rows and columns. This structure allows for easy manipulation and analysis
using various statistical and computational techniques.
• Data Types: Data within a dataset can be of different types, including integers, floating-point
numbers, strings, dates, etc. It's essential to understand the data types of variables in a dataset,
as they determine the type of analysis and operations that can be performed.
• Missing Values: Datasets may contain missing values, represented as NA (Not Available) or NaN
(Not a Number). Dealing with missing data is an important aspect of data preprocessing and
analysis.
• Metadata: Metadata refers to additional information about the dataset, such as variable names,
data types, units of measurement, etc.
Dataset Extraction
• Dataset extraction refers to the process of obtaining or retrieving a dataset from its source,
which could be a file, a database, an online repository, or any other data storage system.
• The extraction process is a crucial initial step in data analysis and involves accessing the data and
loading it into an appropriate data structure within the analysis environment.
overview of the dataset extraction process
Identify the Data Source: Determine where the dataset is located, whether it's a file on your local
machine, a database server, an online repository, or another source.
Choose the Extraction Method: Select the appropriate method for extracting the dataset based on
its source and format. Common extraction methods include:
Dataset Extraction
• File Import: If the dataset is stored in a file (e.g., CSV, Excel, text file), you can use functions like
read.csv(), read_excel(), or read.table() in R to import the data.
• Database Connection: If the dataset is stored in a database, establish a connection to the
database using appropriate packages like RMySQL, RSQLite, or RPostgreSQL, and execute SQL
queries to extract the data.
• Web Scraping: If the dataset is available on a webpage and there is no direct download option,
you can use web scraping techniques with packages like rvest to extract the data from HTML
pages.
• API Access: Some datasets are accessible through APIs (Application Programming Interfaces).
You can use packages like httr to make HTTP requests and retrieve data from APIs.
• Package Loading: In some cases, datasets are available directly within R packages. You can load
Dataset Extraction
• Extract the Dataset: Perform the extraction process using the chosen method. This involves
reading the data from the source and loading it into a data structure within R, such as a data
frame or matrix.
• Handle Data Transformation and Cleaning: After extracting the dataset, you may need to
perform data transformation and cleaning operations to prepare the data for analysis. This could
include tasks such as converting data types, handling missing values, filtering out irrelevant
information, and reorganizing the data structure.
• Verify the Extraction: Once the dataset is extracted and prepared, it's essential to verify that the
data has been loaded correctly and accurately represents the original source.
Dataset Preparation
• Dataset preparation in R involves preprocessing and organizing the data to make it suitable for
analysis or modeling.
• This process aims to clean, transform, and structure the dataset in a way that facilitates effective
analysis and interpretation.
overview of the steps involved in dataset preparation in R
• Data Loading: The first step is to load the dataset into R from its source, such as a file.
• Exploratory Data Analysis (EDA): Before performing any preprocessing, it's crucial to
understand the structure and characteristics of the dataset through exploratory data analysis.
This may involve summarizing the dataset, examining summary statistics, visualizing
distributions, identifying outliers, and understanding the relationships between variables.
Dataset Preparation
• Handling Missing Values: Missing values are common in real-world datasets and need to be
addressed before analysis. Depending on the nature and extent of missingness, you can choose
from various strategies such as imputation (replacing missing values with estimated values),
deletion (removing rows or columns with missing values), or using models that can handle
missing data.
• Data Cleaning: Data cleaning involves identifying and correcting errors, inconsistencies, or
anomalies in the dataset. This may include correcting typos, standardizing variable names,
removing duplicates, and ensuring data consistency across variables.
• Data Transformation: Data transformation involves modifying the structure or values of
variables to better suit the analysis or modeling objectives. Common transformations include
scaling numerical variables, encoding categorical variables, creating new variables through
feature engineering.
Dataset Preparation
• Feature Selection: Feature selection aims to identify the most relevant variables or features that
contribute most to the predictive power of a model.
• Data Splitting: Before modeling, it's essential to split the dataset into training and testing sets to
evaluate the performance of the model. You can use functions like createDataPartition() from the
caret package or train_test_split() from the caret package to split the data.
• Data Integration: If working with multiple datasets, data integration involves combining them
into a single dataset for analysis. This may involve merging datasets based on common variables
or performing other data manipulation operations to integrate the data effectively.
• Data Validation: Finally, it's crucial to validate the prepared dataset to ensure that it meets the
requirements for analysis or modeling. This may involve checking for consistency, accuracy, and
adherence to assumptions.
Data Cleaning
• Data cleaning is a crucial step in the data preparation process that involves identifying and
correcting errors, inconsistencies, and anomalies in the dataset to ensure its accuracy, reliability,
and suitability for analysis or modeling.
Data cleaning tasks and techniques in R:
Handling Missing Values:
• Identifying Missing Values: Use functions like is.na() or complete.cases() to identify missing
values in the dataset.
• Dealing with Missing Values: Decide on an appropriate strategy for handling missing values,
such as imputation (replacing missing values with estimated values), deletion (removing rows or
columns with missing values), or using models that can handle missing data (e.g., imputing
missing values using predictive models).
Data Cleaning
Removing Duplicates:
• Identifying Duplicates: Use functions like duplicated() to identify duplicate rows in the dataset.
• Removing Duplicates: Use the unique() function or subset the data to remove duplicate rows while
retaining the unique observations.
Standardizing Variable Names:
• Ensure that variable names are consistent and follow a standardized format (e.g., lowercase, no spaces,
underscores instead of spaces).
Handling Outliers:
• Identifying Outliers: Use statistical methods such as box plots, histograms, or z-scores to identify outliers
in numerical variables.
• Dealing with Outliers: Decide on an appropriate strategy for handling outliers, such as removing outliers,
transforming variables, or using robust statistical methods that are less sensitive to outliers.
Data Cleaning
Data Type Conversion:
• Convert variables to the appropriate data types (e.g., numeric, factor, date) based on their nature and
intended use in the analysis.
Data Validation:
• Validate the data to ensure that it meets the predefined criteria and assumptions. For example, check for
logical inconsistencies or outliers that may indicate errors in data entry or collection.
Iterative Process:
• Data cleaning is often an iterative process that may require multiple rounds of inspection, correction, and
validation to ensure the dataset's quality and integrity.
Data Imputation
• Data imputation is the process of estimating missing or incomplete values in a dataset based on the
available information.
• It's a common technique used in data preprocessing to handle missing data before performing analysis or
modeling tasks.
Mean/Median/Mode Imputation:
• In this simple approach, missing values are replaced with the mean (for numerical variables), median (for
numerical variables with outliers), or mode (for categorical variables) of the observed values in the
respective variable.
• R provides functions like mean() and median() to calculate mean and median values for imputation.
Last Observation Carried Forward (LOCF):
• This method imputes missing values with the last observed value in the time series data.
• R packages like zoo provide functions like na.locf() to perform LOCF imputation.
Data Imputation
Linear Interpolation:
• Linear interpolation estimates missing values by assuming a linear relationship between consecutive
observed values.
• The approx() function in R can be used for linear interpolation.
Multiple Imputation:
• Multiple imputation generates multiple possible values for missing data based on the observed data and the
assumed distribution of the missing values.
• R packages like mice (Multivariate Imputation by Chained Equations) implement multiple imputation
methods.
K-Nearest Neighbors (KNN) Imputation:
• KNN imputation estimates missing values by averaging the values of the nearest neighbors in the feature
space. The impute.knn() function in R (from the VIM package) can be used for KNN imputation.
Data Imputation
Regression Imputation:
• Regression imputation models the relationship between the variable with missing values and other
variables in the dataset and uses regression analysis to predict missing values.
• R provides functions like lm() to fit regression models for imputation.
Hot-Deck Imputation:
• Hot-deck imputation replaces missing values with randomly selected observed values from similar cases.
• The hotdeck() function in R (from the VIM package) can be used for hot-deck imputation.
Data Conversion
• Data conversion, in the context of data analysis and manipulation, refers to the process of transforming data
from one format or data type to another.
• This transformation may involve converting between different data types, units of measurement, or data
structures to make the data suitable for analysis, visualization, or modeling.
Data Type Conversion:
• Converting data from one data type to another is a fundamental aspect of data conversion. For example,
converting numeric data stored as character strings to numeric data type, or converting categorical
variables to factors for analysis.
• In R, you can use functions like as.numeric(), as.character(), as.factor(), etc., to perform data type
conversion.
Data Conversion
Unit Conversion:
• When dealing with data that involve measurements, it's often necessary to convert between different units
of measurement. For example, converting temperatures from Fahrenheit to Celsius, distances from miles to
kilometers, or currency from one currency unit to another.
• R provides functions for unit conversion, but you may need to specify the conversion factors or use specific
packages depending on the units involved.
Date and Time Conversion:
• Data often include date and time information, which may be stored in various formats. Data conversion may
involve standardizing date and time formats, parsing date and time strings, or converting between different
date and time representations (e.g., from character strings to Date objects).
• R provides functions like as.Date(), as.POSIXct(), strptime(), etc., for date and time conversion.
Data Conversion
String Manipulation:
• Data conversion may involve manipulating and transforming character strings to extract relevant
information or reformat text data. R provides functions like tolower(), toupper(), strsplit(), paste(), etc.,
for string manipulation.
Data Structure Conversion:
• This could involve converting data between wide and long formats, pivoting or melting data, or reshaping
data frames.
• R provides functions like reshape(), melt() (from the reshape2 package), pivot_longer(), pivot_wider()
(from the tidyr package), etc., for data structure conversion.
Encoding Conversion:
• This is particularly important when working with text data that may be encoded in different character sets
or encodings. R provides functions like iconv() for character encoding conversion.