Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
29 views55 pages

Arunav Da Prac

The document is a practical file for a Data Analytics course using R, detailing various assignments and exercises related to R programming. It covers topics such as installing R and RStudio, creating and manipulating data structures, applying functions, and interpreting outputs from datasets like iris and Boston. The file serves as a comprehensive guide for students to learn and apply data analytics techniques using R.

Uploaded by

Arunav Pathak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views55 pages

Arunav Da Prac

The document is a practical file for a Data Analytics course using R, detailing various assignments and exercises related to R programming. It covers topics such as installing R and RStudio, creating and manipulating data structures, applying functions, and interpreting outputs from datasets like iris and Boston. The file serves as a comprehensive guide for students to learn and apply data analytics techniques using R.

Uploaded by

Arunav Pathak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Practical File

Of
Data Analytics Using R

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

GURU JAMBHESHWAR UNIVERSITY OF SCIENCE


AND TECHNOLOGY
Hisar – Haryana (India)

Submitted To: Submitted By: Ansh Kinha


Mr. Davinder Sir Roll Number: 230010150020
Department of CSE Class: B.Tech (CSE-AI & ML)

1
INDEX

Program
Sr. Page Teacher’s
No. No. Signature
1 Install R and then install R Studio. Get yourself acquainted 6-7
with the GUI of various working windows of RStudio.
2 Create the following objects in R and then check their class 8-9
(a). A vector of strings
(b). A vector consisting of factor type data. For instance,
vector consisting of hair color of a few individuals. (c). A
list data type consisting of vectors of names of five students
and a matrix of the marks of students in four courses.
(d). A data frame consisting of names of students, their age,
total marks, and grades awarded.

3 Apply str and summary commands to the object created in 10-11


assignment 2. Interpret the output.
4 Check and justify the outcome of the following 12
expressions: (a). sqrt(3)^2 == 3
(b). near(sqrt(3)^2,3)

2
5 Install, load package ‘stringdist’ and run the following code: 13-14
my_string=
c(“Viraj”,“Viraj”,“Viraj”,“Vikraj”,“Viraji”,“Viroj”,“Vroj”,“
Siroji”)
name=“Viraj” matched =
(stringdist(my_string, name) == 0) matched
= (stringdist(my_string, name) == 1)
matched = (stringdist(my_string, name) ==
2)
Interpret the output. [Hint: ‘stringdist’ is a package to find
the distance between strings in term of replacement,
insertion and declaration of letters.
6 Apply summary command to iris dataset of the ‘datasets’ 15-21
package and interprets the output.

7 Use plot (iris) function and interpret the output. Write down 22
your finding about the dataset.
8 Install and load the MASS package and access the Boston 23-25
dataset. Study the dataset from the resources available on the
internet and write what you can find relevant to the Dataset.

9 Write a script file to compute the following of the numeric 26


variable in Boston dataset.
(a). Sum
(b). Range
(c). Mean
(d). Standard deviation

3
10. Create a vector x of all those values from 1:100 that are 27-28
divisible by 5 and do the following operations on the
vector:
(a). Find the length of vector x.
(b). Print the values stored at the fifth, tenth, and fifteenth
location of vector x.
(c). Find the sum mean range median and standard deviation
of vector x.
(d). Replace the fifth and tenth values with NA and NaN
values, respectively and find the mean of modified vector.
(e). Check if x contains any NA values and print the indices
of NA values in vector x.
(f). Remove NA values from vector x and use summary
command on it.
(g). Print the values of first and third quartile of vector x from
the output of the summary command.

11. Assume the given vectors and do the following operations: 29-31
x=1:12; y=13:24; z=1:6; a=1:12;
b=(13,15,17,19,20,21,23,25,27,29,31,24);
c=(5,10,15,NA,25,NaN);
v=(26,21,87,56,72,60); k=(0,2,4,6,8,16,32);
(a). Find x × y and x × y × z and interpret the output. (b). Do
an element-wise comparison between x and a, and y and b.
(c). Find all the elements that are greater than 6 of vector x
and store these elements into another vector p.

4
(d). Check for NA and NaN values in vectors b and c.
(e). Check if overall vector x is equal to vector a and vector
b.
(f). Why does identical(x, z) evaluate to FALSE?
(g). What is the difference between all() and all.equal()
functions? Illustrate with the help of an example.
(h). Run any(x, z) function and interpret the out.
(i). Create a new vector of the non-NA values of vector c
using a single line code.
(j). Sort vector v in descending order and output the original
indices in order of the sorted elements. Find log to the base 2
of vector k.
12. Assuming the character vector cv = c(“sunita”, “bimla”,
“kavita”, “geeta”, “anu”, “dikshita”, “susmita”, “seema”):
(a). Find the character count in each name.
(b). Find the geeta exist in vector cv
13. Output the indices of the names that contain substring ee in
vector cv of ques: 12.
14. Find out how many strings end with the letters ta in vector cv
of ques: 12.
15. Create a vector of factor data type for the hair colors of ten
people where values for hair colors are black, darkbrown,
grey, and blond.
(a). Display the levels of factor data.
(b). Find the model value in the vector of hair colors.
16. Apply class, str, and summary commands to the vector
created in assignment 13.
17. Create an empty vector of factor data type for the names of
the first six month in a year. Remember to keep the levels of
the data in order of the months, from January to June.
18. Create a vector to store the grades of 20 students for the first
minor exam. Grades are given at four levels (A,B,C,D).
Compute the model grade of same students for the second
minor exam. Count the number of students who have got a
higher grade in the second minor.

5
19. Create a matrix m of five rows in a row-major order of
numbers from 1 to 100 incremented by a step of 5 units:
(a). Find row and columns-wise means of matrix m.
(b). Find the minimum value for each row and column.
(c). Find the transpose and sort the values in each columns

in decreasing order.
(d). Assign the row names as R1to R5 and column names C1
to C4.
(e). Display all the elements of the second and fourth
column without using indices.
(f). Display all the elements of the first and third row without
using indices.
(g). Create a new matrix by deleting the second and
fourth column of matrix m using indices and column
names. (h). Replace elements at indices (2,3), (2,4), (3,3),
and (3,4) with NA values.
(i). Replace elements at index (1,3) with NaN.
(j). Check if matrix m contains any NA or NaN values and
interpret the output.
(k). Create two new matrices rm and cm by concatenating
matrix m row-wise and column-wise with itself.
20. Interpret the output of the following commands:
(a). n=matrix(rep(m,2), nrow=ncol(m), byrow=FALSE)
(b). n=matrix(rep(m,2), nrow=nrow(m), byrow=FALSE)
(c). n=matrix(rep(m,2), nrow=ncol(m), byrow=TRUE)
(d). m1=do.call(rbind, replicate(2,m,simplify=FALSE))
(e). m2=do.call(cbind, replicate(2,m,simplify=FALSE)) (f).
Rename row and column names as per the requirements of
matrix m1 and m2.
21. Create a 4*3 matrix A of uniformly distributed random
integer numbers between 1 to 100. Create another 3*4
matrix B with uniformly distributed random integer
numbers between 1 to 10. Perform matrix
multiplication of the two matrices and store the result in
a third matrix C.

6
22. Replicate the resulting matrix C obtained in ques: 21 twice
vertically.
23. Create A and B, two 4*3 matrices of normally distributed
random numbers, with mean 0 and standard deviation 1. Find
the indices of all those numbers in matrix A which are less than
the respective numbers in matrix B and print these numbers.

24. Plotting pressure dataset in different forms:


(a).Histogram
(b).Boxplot

7
1. Install R and then install R Studio. Get yourself acquainted with
the GUI of various working windows of RStudio.

Steps to install R and RStudio:


i. Install R. Download the R installer from https://cran.r‐
project.org/ ii. Install RStudio. Download RStudio:
https://www.rstudio.com/products/rstudio/download/
iii. Check that R and RStudio are working. Open RStudio. ...
iv. Install R packages required for the workshop.

8
2. Create the following objects in R and then check their class
a) A vector of strings
b) A vector consisting of factor type data. For instance, vector
consisting of hair color of a few individuals.
c) A list data type consisting of vectors of names of five students
and a matrix of the marks of students in four courses.
d) A data frame consisting of names of students, their age, total
marks, and grades awarded.

Code:
##to print a vector of string
s = c("we","are","good","friends")
class(s)

##to print the class of factor


hair_color =
factor(c("black","blue","brown","black","blue","black","black","b
rown"))
class(hair_color)

##to print the list


x = c("hiteshi","jyoti","taruna","pragani","annu")
y=
matrix(c(90,80,70,80,90,80,70,60,50,40,50,60,70,80,80,90,80,70,
50,40),nrow=4,ncol=5)
l = list(x,y)
l
class(l)

##to print the class


name = c("hiteshi","jyoti","taruna","pragani","annu")
age = c(20,20,20,19,20)
9
marks = c(90,80,70,60,50)
grades = c("A","B","C","D","E")
data = data.frame(name , age , marks , grades)
data
class(data)

Output:

10
3. Apply str and summary commands to the object created in
assignment 2. Interpret the output.

Code:

##to print str and summary


s = c("we","are","good","friends")
str(s)
summary(s)

##to print the str and summary of factor


hair_color=factor(c("black","blue","brown","black","blue","black","bl
ack","brown"))
str(hair_color)
summary(hair_color)

##to print the str and summary of list


x = c("hiteshi","jyoti","taruna","pragani","annu")
y=matrix(c(90,80,70,80,90,80,70,60,50,40,50,60,70,80,80,90,80,70,5
0,40),nrow=4,ncol=5)
l = list(x,y)
str(l)
summary(l)

##to print the str and summary of data frame


name = c("hiteshi","jyoti","taruna","pragani","annu")
age = c(20,20,20,19,20)
marks = c(90,80,70,60,50)
grades = c("A","B","C","D","E")
data = data.frame(name , age , marks , grades)
data
str(data)
summary(l)
11
Output:

INTERPRETATION:

• str function is used to compactly display the internal structure of


an R object. It is also used to show a more reasonable output.
• Summary function is another way to explore the R object.
As shown in the output, using summary function gives us
various attributes of a field like min. value, max. value,
quantile value, mean and median for numeric data type and
length for a character data type.
12
4. Check and justify the outcome of the following expressions:
a) sqrt(3)^2 == 3
b) near(sqrt(3)^2,3)

Code:

library("dplyr")
#Part-A sqrt(3)^2==3
#Part-B
near(sqrt(3)^2,3)

Output:

13
5. Install, load package ‘stringdist’ and run the following code:
my_string=c(“Viraj”,“Viraj”,“Viraj”,“Vikraj”,“Viraji”,“Viro
j”,“Vroj”,“ Siroji”) name=“Viraj”
matched = (stringdist(my_string, name) == 0) matched
= (stringdist(my_string, name) == 1) matched =
(stringdist(my_string, name) == 2) Interpret the
output.
[Hint: ‘stringdist’ is a package to find the distance between strings
in term of replacement, insertion and declaration of letters.]

Code:

library("stringdist")
my_strings=c("viraj","virat","vikraj","viraji","viroj","vroj","siroji"
)
name="viraj"
matched=(stringdist(my_strings,name)==0)
matched
matched=(stringdist(my_strings,name)==1)
matched
matched=(stringdist(my_strings,name)==2)
matched

Output:

14
INTERPRETATION:

• Stringdist package is used to match two strings and


can show “TRUE” if differs with insertion, deletion or
replacement by 1 or 2 alphabets as shown in the
output.
• For ==0, only exact string has shown true.
• For ==1, only those which differs from main string
with exactly one alphabet whether it is insertion,
deletion or replacement.
• For ==2, same as ==1 but here two alphabets can be
excluded with replacement, insertion or deletion.

15
6. Apply summary command to iris dataset of the ‘datasets’ package
and interprets the output.

Code:

library("datasets")
iris
summary(iris)

Output:

16
17
18
19
20
21
INTERPRETATION:

• Applying summary command to iris dataset gives us


the min. value, max. value, 1st and 3rd quartile value,
mean and median of the numeric value.
• Here, summary function has been applied on different
attributes include in the iris dataset such as species,
petal.width, petal.length, sepal.length and
sepal.width.

22
7. Use plot (iris) function and interpret the output. Write down your
finding about the dataset.

Code:

plot(iris)

Output:

INTERPRETATION:

• Plot function is used to scatterly plotting of R object.


• Applying plot on iris gives us a plot of different values in iris
dataset scattered randomly but some of the values in different
fields are spread away from the group showing us the variety in
the particular field.

23
8. Install and load the MASS package and access the Boston dataset.
Study the dataset from the resources available on the internet and
write what you can find relevant to the Dataset.

Code:

library("MASS")
Boston

Output:

24
25
26
9. Write a script file to compute the following of the numeric variable
in Boston dataset. a) Sum
b) Range
c) Mean
d) Standard deviation

Code:

Library(MASS)
sum(Boston)
range(Boston)
sapply(Boston,mean)
sapply(Boston,sd)

Output:

27
10. Create a vector x of all those values from 1:100 that are
divisible by 5 and do the following operations on the vector:
a) Find the length of vector x.
b) Print the values stored at the fifth, tenth, and fifteenth
location of vector x.
c) Find the sum mean range median and standard deviation of
vector x.
d) Replace the fifth and tenth values with NA and NaN values,
respectively and find the mean of modified vector.
e) Check if x contains any NA values and print the indices of
NA values in vector x.
f) Remove NA values from vector x and use summary
command on it.
g) Print the values of first and third quartile of vector x from the
output of the summary command.

Code:

x=c(5*1:20)
x
#part-A
length(x)
#part-B
x[c(5,10,15)]
#part-C
sum(x)
mean(x)
range(x)
median(x)
sd(x)
x[5]=NA
x[10]=NaN
x
y= mean(x)
28
y
z=mean(x,na.rm=TRUE)
z
a = which(is.na(x))
a
summary(x,na.rm=TRUE)
x=c(seq(5,100,5))
x
summary(x)
b=summary(x)
b["1st Qu."]
b["3rd Qu."]

Output:

29
11. Assume the given vectors and do the following operations:
x=1:12; y=13:24; z=1:6; a=1:12;
b=(13,15,17,19,20,21,23,25,27,29,31,24); c=(5,10,15,NA,25,NaN);
v=(26,21,87,56,72,60); k=(0,2,4,6,8,16,32);
(a). Find x × y and x × y × z and interpret the output.
(b). Do an element-wise comparison between x and a, and y and b. (c).
Find all the elements that are greater than 6 of vector x and store these
elements into another vector p.
(d). Check for NA and NaN values in vectors b and c.
(e). Check if overall vector x is equal to vector a and vector b.
(f). Why does identical(x, z) evaluate to FALSE?
(g). What is the difference between all() and all.equal() functions?
Illustrate with the help of an example.
(h). Run any(x, z) function and interpret the output.
(i). Create a new vector of the non-NA values of vector c using a single
line code.
(j). Sort vector v in descending order and output the original indices in
order of the sorted elements. Find log to the base 2 of vector k.

Code:

x=1:12
y=13:24
z=1:6
a=1:12
b=c(13,15,17,19,20,21,23,25,27,29,31,24)
c=c(5,10,15,NA,25,NaN)
v=c(26,21,87,56,72,60)
k=c(0,2,4,8,16,32)
#Part-A
x*y
30
x*y*z
#Part-B
x==a
y==b
#Part-C
p=c(which(x>6))
p
#Part-D
which(is.na(b))
which(is.na(c))
#Part-E
all(x==a)
all(x==b)
#Part-F
identical(x,z)
#Part-G
r=seq(0,1,by=0.2
)
r
s=c(0.0,0.2,0.4,0.6,0.8,1.0)
all(r==s)
all.equal(r,s)
#Part-H
any(x==z)
#Part-I
p=na.omit(c)
P
#Part-J
sort(v,decreasing = TRUE)
#Part-K
log2(k)

31
Output:

32
12. Assuming the character vector cv = c(“sunita”, “bimla”, “kavita”,
“geeta”, “anu”, “dikshita”, “susmita”, “seema”): (a). Find the character
count in each name.
(b). Find the geeta exist in vector cv.

Code:

cv=c("sunita","bimla","kavita","geeta","anu","dikshita","sushmita","see
ma")
#part-A
nchar(cv)
#part-B
'geeta'%in%cv

33
13. Output the indices of the names that contain substring ee in vector
cv of ques: 2.

Code:

cv=c("sunita","bimla","kavita","geeta","anu","dikshita","sushmita","see
ma")
which(grepl('ee',cv))

Output:

34
14. Find out how many strings end with the letters ta in vector cv of
ques: 2.

Code:

cv=c("sunita","bimla","kavita","geeta","anu","dikshita","sushmita","see
ma")
endsWith(cv,'ta')

Output:

35
15 . Create a vector to store the grades of 20 students for the first minor
exam. Grades are given at four levels (A,B,C,D). Compute the model
grade of same students for the second minor exam. Count the number
of students who have got a higher grade in the second minor.

Code:

minor1=factor(c("A","B","C","D","C","B","C","D","B","B","C","D","B
”," B","B","D"),
levels=c("A","B","C","D"),ordered=TRUE)
which.max(table(minor1))

minor2=factor(c("D","B","C","D","C","C","C","D","D","B","C","D","B","
B","A","D"),
levels=c("A","B","C","D"),ordered=TRUE)
minor1==minor2
sum(minor1>minor2)

Output:

36
16. Create a matrix m of five rows in a row-major order of numbers
from 1 to 100 incremented by a step of 5 units:
(a). Find row and columns-wise means of matrix m.
(b). Find the minimum value for each row and column. (c).
Find the transpose and sort the values in each columns in
decreasing order.
(d). Assign the row names as R1to R5 and column names C1 to C4.
(e). Display all the elements of the second and fourth column without
using indices.
(f). Display all the elements of the first and third row without using
indices.
(g). Create a new matrix by deleting the second and fourth column of
matrix m using indices and column names.
(h). Replace elements at indices (2,3), (2,4), (3,3), and (3,4) with NA
values.
(i). Replace elements at index (1,3) with NaN.
(j). Check if matrix m contains any NA or NaN values and interpret the
output.
(k). Create two new matrices rm and cm by concatenating matrix m
row-wise and column-wise with itself.

Code:

m=matrix(seq(1,100,5),nrow=5,byrow=TRUE)
m
#Part-A
rowMeans(m)
colMeans(m)
#Part-B
apply(m,MARGIN=1,min) apply(m,MARGIN=2,min)
#Part-C
t(m)
apply(t(m), MARGIN=2, function(x) sort(x,decreasing= TRUE))

37
#Part-D
rownames(m)=c("R1","R2","R3","R4","R5")
colnames(m)=c("C1","C2","C3","C4")
m
#Part-E
m[,c("C2","C4")]
#Part-F
m[c("R1","R3"),]
#Part-G
m1=m[,-c(2,4)]
m1
#Part-H
m[(2:3),(3:4)]=NA
m
#Part-I
m[1,3]=NaN
m
#Part-J
is.na(m)
is.nan(m)
#Part-K
rm=matrix(rep(m,2),ncol=ncol(m),byrow=FALSE)
rm
cm=matrix(rep(m,2),nrow=nrow(m),byrow=TRUE)
cm

Output:

38
39
40
17. Interpret the output of the following commands:
(a). n=matrix(rep(m,2), nrow=ncol(m), byrow=FALSE)
(b). n=matrix(rep(m,2), nrow=nrow(m), byrow=FALSE)
(c). n=matrix(rep(m,2), nrow=ncol(m), byrow=TRUE)
(d). m1=do.call(rbind, replicate(2,m,simplify=FALSE))
(e). m2=do.call(cbind, replicate(2,m,simplify=FALSE))
(f). Rename row and column names as per the requirements of matrix
m1 and m2.

Code:

#Part-A
n=matrix(rep(m,2), nrow=ncol(m), byrow=FALSE)
n
#Part-B
n=matrix(rep(m,2), nrow=nrow(m), byrow=FALSE)
n
#Part-C
n=matrix(rep(m,2), nrow=ncol(m), byrow=TRUE)
n
#Part-D
m1=do.call(rbind, replicate(2,m,simplify=FALSE)) m1
#Part-E
m2=do.call(cbind, replicate(2,m,simplify=FALSE)) m2
#part-F
rownames(m1)=c("R1","R2","R3","R4","R5","R6","R7","R8","R9","R141
0")
colnames(m2)=c("C1","C2","C3","C4","C5","C6","C7","C8"
)
m1
m2

Output:

42
43
44
18. Create a 4*3 matrix A of normally distributed random numbers with
mean 100 and standard deviation 10. Create another 3*4 matrix B with
normally distributed random i numbers with mean 10 and standard
deviation 1. Perform matrix multiplication of the two matrices and store
the result in a third matrix C rounded upto two decimals .

Code:

# Set seed for reproducibility


set.seed(123)

# Create matrix A (4x3) with mean 100 and standard deviation 10


A <- matrix(rnorm(12, mean = 100, sd = 10), nrow = 4, ncol = 3)

# Create matrix B (3x4) with mean 10 and standard deviation 1


B <- matrix(rnorm(12, mean = 10, sd = 1), nrow = 3, ncol = 4)

# Perform matrix multiplication of A and B to get matrix C


C <- A %*% B

# Round the result to two decimal places


C_rounded <- round(C, 2)

# Print the matrices and the result


cat("Matrix A:\n")
print(A)

cat("\nMatrix B:\n")
print(B)

cat("\nMatrix C (Result of A * B) rounded to two decimal places:\n")


print(C_rounded)

Output:

45
46
19. Create a 4*3 matrix A of uniformly distributed random integer
numbers between 1 to 100. Create another 3*4 matrix B with uniformly
distributed random integer numbers between 1 to 10. Perform matrix
multiplication of the two matrices and store the result in a third matrix C.

Code:

a=matrix(runif(12,1,100),nrow=4)
a
b=matrix(runif(12,1,10),nrow=3)
b
c=a%*%b
c

Output:

47
48
20. Replicate the resulting matrix C obtained in ques: 21 twice vertically.

Code:

c1=do.call(rbind,replicate(2,c,simplify = FALSE))
c1

Output:

49
21. Create A and B, two 4*3 matrices of normally distributed random
numbers, with mean 0 and standard deviation 1. Find the indices of all
those numbers in matrix A which are less than the respective numbers
in matrix B and print these numbers.

Code:

A=matrix(rnorm(12),nrow=4)
A
B=matrix(rnorm(12),nrow=4)
B
K=which(A<B)
K
A[K]

Output:

50
51
24.Plotting dataset in different forms:

Code:

library("datasets")
pressure
plot(pressure)

Output:

52
(a).Histogram

Code:

hist(pressure$temperature,main="frequency distribution of temperature


variable",xlab="temperature(incelsius)",ylab="frequency",border="
black",col=c("violet","darkred","blue","green","yellow","orange","
red","white"))
box("figure")

Output:

53
Code:

hist(pressure$pressure,main="probability distribution of pressure


variable", xlab="pressure(in
mg)",breaks=10,freq=F,border="black",col=c("violet","darkred","b
lue","green","yellow","orange","red","white","brown")) box("figure")

Output:

54
(b).Boxplot

Code:

#Kunal Malik
#210010150018
#CSE-AI&ML

boxplot(pressure,main="box plots of variables of the pressure


dataset",names=c("Temperature(celsius)","pressure(mg)"),border=
"black",col=c("blue","red"))

Output:

55

You might also like