EX.
NO:1 DPLYR PACKAGE
DATE:
Do the data manipulation operations for iris and mtcars dataset using dplyr package and obtain
the results for following functions
i)filter
ii)select
iii)arrange
iv) mutate
v) summarise
AIM:
To Do the data manipulation operations for iris and mtcars dataset using dplyr package and obtain the
results for following functions.
Procedure and code:
i) Filter:
To install dplyr, use the below command
install.packages("dplyr")
To load dplyr, use the below command
library(dplyr)
Loading a data set
data("mtcars")
data("iris")
mydata <- mtcars
mydata
Creating a local data frame. Local frame are easier to read
mynewdata <- tbl_df(mydata)
mynewdata
myirisdata <-tbl_df(iris)
myirisdata
Use filter to filter data with required condition
filter(mynewdata,cyl>4 & gear>4)
filter(mynewdata,cyl>4)
filter( myirisdata,Species %in% c('setosa' , 'virginica'))
II) select:
When you are working with large datasets with many columns, but you are interested in a few,
select() allows you to rapidly zoom in on on a useful subset using operations that usually only work
on numeric variable positions.
select(mynewdata, cyl, mpg, hp)
Hide a range of columns:
select(mynewdata,-c(mpg, cyl, disp))
III) arrange:
mynewdata %>%
select( cyl,wt,gear)%>%
arrange( desc(wt))
IV) mutate:
This function ,mutate() adds new variables while preserving the existing ones. mutate() is used to
select sets of existing columns and add new columns that are functions of existing columns.
mynewdata %>%
select( mpg,cyl)%>%
mutate( newvariable = mpg*cyl)
v)summarise:
The summarise() function collapses a data frame to a single row
myirisdata %>%
group_by(Species)%>%
summarise(average=mean(Sepal.Length,na.rm=TRUE))
RESULT:
The dplyr package Program has been Executed Successfully
EX.NO:2 TIDYR PACKAGE
DATE:
Create a data frame and do the following operations using tidyr package
i)gather
ii)spread
iii) separate
iv)unite
AIM:
To Create a data frame and do the following operations using tidyr package
Procedure and Code:
Installing tidyr package
install.packages('tidyr')
library(tidyr)
Creating a dummy data set.
name <- c('Akanash', 'Bhanu','Vinay', 'Varun', 'Prashanth')
weight <- c(35,45,55,65,75)
age <- c(20,21,22,23,24)
class <- c('maths','physics','chemistry','biology','science')
Create a data frame
tdata <- data.frame( name, weight, age, class)
tdata
I) gather():
gathers multiple columns and converts them into key: value pairs. This function transforms wide form of
data to long form. It can be used as an alternative to ‘melt’ in reshape package.
longt <- tdata %>% gather( key, value, weight:class)
longt
II) Spread():
Does reverse of gather. It accepts a key:value pair and converts it into
Separate columns.
wide <- longt %>% spread( key, value)
wide
III) separate():
Splits a column into multiple columns.
Use the separate function when you have date time variable in the data set. Because a column contains
multiple information , It make sense to split it and use those values individually. The following code
shows the usage of the separate function.
Create a data frame:
Humidity <- c(37.79,42.34,52.16,44.57,48.83,44.59)
Rain <- c(0.971360441,1.1096716,1.06475853,0.953183435,0.98878849,0.9887643)
Time <- c("13/03/2018 23:24","09/01/2019 15.44","25/12/2018 19:15","02/01/2019 07:46","14/03/2018
01:55","20/10/2018 20:52")
dset <- data.frame (Humidity,Rain,Time)
dset
Using separate function we can separate date, month,year.
separate_d <- dset %>% separate(Time,c('Date','Month','Year'))
separate_d
IV) unite():
Does reverse of separate. It unites multiple columns into single column.
unite_d <-separate_d%>%unite( Time,
c(Date, Month, Year), sep="/")
unite_d
RESULT:
The tidyr package Program has been Executed Successfully
EX.NO:3 TABLE PACKAGE
DATE:
Do the following operations for any external dataset using data. table package
i) Select a subset row
ii) Select a column with particular values
iii) Select columns with multiple values
AIM:
To do the data manipulation operations for external csv file using data.table package.
Procedure and Code:
Loading air quality data:
data("airquality")
mydata <- airquality
mydata
Loading iris data:
data("iris")
myiris <- iris
myiris
Converting into table format:
install.packages("data.table")
library(data.table)
myirisdata <-data.table(myiris)
myirisdata
i) Select subset rows:
mydata[2:4,]
ii) Select a column with particular values
myirisdata[Species == 'setosa’]
iii) Select columns with multiple values
myirisdata[Species %in% c('setosa','virginica')]
RESULT:
Thus, the Data Manipulation using DATA. Table Package Executed Successfully.
EXPNO: 4 GGPLOT PACKAGE
DATE:
AIM:
To do the different types of visualization for air quality data set using ggplot
package in R.
a. Line graph
a. Bar graph
a. Histogram
a. Scatter plot
a. Pie chart
PROCEDURE AND CODE:
Ggplot2 is a plotting package that provides helpful commands to create plots
from data in a data frame. It provides a more programmatic interface
Installation and Loading:
install.packages("ggplot2")
library(ggplot2)
The 'iris' data comprises of 150 observations with 5 variables.
i) Line graph:
ggplot(iris, aes(x=Sepal.Length, color=Species)) + geom_density( )
OUTPUT:
ii) Bar graph:
ggplot(mpg, aes(x= class)) + geom_bar()
qplot->plot function from ggplpt2 library
geom -> geometry (plot type)
fill -> Denotes colour
OUTPUT:
iii) Histogram:
ggplot(data = iris, aes( x = Sepal.Length)) + geom_histogram( )
OUTPUT:
iv) Scatter plot:
library(ggplot2)
ggplot(mtcars, aes(x = drat, y = mpg)) +
geom_point()
OUTPUT:
RESULT:
Thus, the Data Visualisation using DATA. GG Plot Package Executed Successfully.
EX.NO:5 PANDAS PACKAGE
DATE:
AIM:
To Do the data manipulation operations for iris and airquality dataset
using data.table package and obtain the results for following
functions.
a. Select a subset row
b. Select a column with particular values
c. Select columns with multiple values
d. Select a column to return a vector
e. Select multiple columns
f. Returns the sum and standard deviation
g. Sum of selected columns
PROCEDURE AND CODE:
Data.table package is a enhanced version of data.frame s, which
are the standard data structure for storing data in base R.
To install dplyr, use the below command
install.packages("data.table")
To load dplyr, use the below command
library(data.table)
Converted data set into data.table:
data<-as.data.table(iris)
air<-as.data.table(airquality)
head(air)
head(data)
i) Select subset rows:
head(data[,2:4])
OUTPUT:
ii) Select a column with particular values:
data[Species == 'setosa‘]
OUTPUT:
iii) Select columns with multiple values:
data [Species %in% c('setosa','virginica')]
OUTPUT:
iv) Select a column to return
a vector: air [, Temp]
OUTPUT:
v) Select multiple columns: air[,.(Temp,Month)]
OUTPUT:
vi) Returns the sum and standard deviation
myairdata[,.(sum(Ozone,na.rm=TRUE),
sd(Ozone,na.rm=TRUE))] OUTPUT:
vii) Sum of selected columns:
myairdata[,sum(Ozone,na.rm=TRUE)]
OUTPUT:
RESULT:
Thus, the DATA.TABLE Package Program has been Excuted
Successfully.
EX.NO:6 CREATE THE VISUALIZATION GRAPHS
DATE:
AIM:
To create the different types of graphs for user inputs. 1)Line graph 2)Line
Graph with style 3)Bar Graph(Horizontal and verticle) 4)Histogram
5)Scatter Plot.
PROCEDURE AND CODE:
1. LINE GRAPH:
import matplotlib.pyplot as plt
x=[5,6,8,10,15]
y=[20,30,40,50,55]
plt.plot(x,y)
plt.title("STUDENT DATA-LINE GRAPH")
plt.ylabel('Present %')
plt.xlabel('Roll.no')
plt.show()
OUTPUT PLOT:
2. LINE GRAPH WITH STYLE:
import matplotlib.pyplot as plt
import matplotlib.style
x=[5,6,8,10,15]
y=[20,30,40,50,55]
x2=[2,13,16,20,18]
y2=[25,35,16,23.5,40]
plt.plot(x,y,'c',label='A',linewidth=6)
plt.plot(x2,y2,'purple',label='B',linewidth=6)
plt.title('STUDENT DATA-LINE GRAPH WITH STYLE')
plt.ylabel('Present %')
plt.xlabel('Roll.no')
plt.legend()
plt.show()
OUTPUT PLOT:
3. BAR
GRAPH:
A-VERTICAL
import matplotlib.pyplot as plt
studentnames = ['Adeline','Jane','Roo','Bluewhale','Rossey'] marks =
[85,55,90,45,60]
plt.bar(studentnames,marks,color='purple')
plt.title('STUDENT DATA-BAR GRAPH VERTICAL')
plt.xlabel('NAMES')
plt.ylabel('MARKS)
plt.show()
OUTPUT PLOT:
B-HORIZONTAL :
import matplotlib.pyplot as plt
studentnames = ['Adeline','Jane','Roo','Bluewhale','Rossey'] marks =
[85,55,90,45,60]
plt.barh(studentnames,marks,color='c')
plt.title('STUDENT DATA-BAR GRAPH VERTICAL')
plt.xlabel('`MARKS')
plt.ylabel ('NAMES')
plt.show()
OUTPUT PLOT:
4. HISTOGRAM:
import matplotlib.pyplot as plt
student_marks=[45,12,13,26,15,55,100,98,95,54,58,56,52,24,71,66,6
6.5,12,23,55,78,10,9,5,10,22,35,65,45]
bins=[0,10,20,30,40,50,60,70,80,90,100]
plt.hist(student_marks,bins,histtype='bar',rwidth=0.8,color='purple')
plt.xlabel('MARKS')
plt.ylabel('NUMBER OF STUDENT')
plt.title('STUDENT DATA-HISTOGTAM')
plt.show()
OUTPUT PLOT:
5. SCATTER PLOT:
import matplotlib.pyplot as plt
import matplotlib.style
x=[5,6,8,10,15]
y=[20,30,40,50,55]
x2=[2,13,16,20,18]
y2=[25,35,16,23.5,40]
plt.scatter(x,y,color='purple')
plt.scatter(x2,y2,color='c')
plt.title=('STUDENT DATA-SCATTER PLOT')
plt.ylabel('Present %')
plt.xlabel('Roll.no')
plt.show()
OUTPUT PLOT:
RESULT:
The Visualization Graphs Program has been Executed Successfully
EX.NO:7 EXPLORATORY DATA ANALYSIS(EDA)
DATE:
AIM:
To Write the R program to implement the Exploratory
Data Analysis for the inbuild data set in data
visualization
PROCEDURE AND CODE:
The EDA approach can be used to gather knowledge about the
following aspects of data:
Main characteristics or features of the data.
Finding out the important variables that can be used in our
problem
In R Language, we are going to perform EDA under two broad
classifications:
Descriptive Statistics, which includes mean, median, mode,
inter-quartile range, and so on.
Graphical Methods, which includes histogram, density
estimation, box plots, and so on.
Reading dataset:
import pandas as pd
import numpy as np
data=pd.read_csv(r"D:\Dataset\train.csv")
print(data)
OUTPUT:
1) Getting first few rows of the dataset:
print(data.head())
OUTPUT:
2) Getting shape of the data:
print(data.shape)
OUTPUT:
3) Checking missing values in the data:
print(data.isnull().sum())
OUTPUT:
4) Checking Data Types of the data:
print(data.dtypes)
OUTPUT:
5) Filling missing values with categorical variable mode:
data["Gender"].fillna(data["Gender"].mode()[0],inplace=Tr
ue)
data["Married"].fillna(data["Married"].mode()[0],inplace=T
rue)
data["Dependents"].fillna(data["Dependents"].mode()[0],inp
lace=True)
data["Self_Employed"].fillna(data["Self_Employed"].mode(
)[0],inplace=True)
data["Loan_Amount_Term"].fillna(data["Loan_Amount_Te
rm"].mode()[0],inplace=True)
data["Credit_History"].fillna(data["Credit_History"].mode(
) [0],inplace=True)
6) Filling missing values with continuous variable
with mean:
data["LoanAmount"].fillna(data["LoanAmount"].mean(),inp
lace=True)
7) Checking missing values:
print(data.isnull().sum())
OUTPUT:
8) Converting Categorical into numerical:
print(data['Gender'].replace(['Male', 'Female'], [0,
1], inplace=True))
print(data['Married'].replace(['No', 'Yes'], [0, 1], inplace=True))
print(data['Dependents'].replace(['0', '1', '2', '3+'], [0, 1, 2, 3],
inplace=True))
print(data['Education'].replace(['Not Graduate', 'Graduate'], [0,
1], inplace=True))
print(data['Self_Employed'].replace(['No', 'Yes'], [0, 1],
inplace=True))
print(data['Property_Area'].replace(['Rural', 'Semiurban',
'Urban'], [0, 1, 2], inplace=True))
print(data['Loan_Status'].replace(['N', 'Y'], [0, 1], inplace=True))
9) Checking data values:
print(data.head())
OUTPUT:
10) Saving the pre-processed data:
data.to_csv(“new_data.csv”,index=False)
RESULT:
The Exploratory data Analysis Program has been Executed Successfully
EX.NO: 8 IBM WATSON STUDIO-PROJECT
DATE:
AIM:
To create a new Data visualization project in IBM Watson
Studio Using individual account.
PROCEDURE AND CODE:
1) Open IBM Watson studio and Login using your account. The
project and the catalog must be created by members of the same IBM
Cloud account.
2) Click New project on the home page or on your Projects page.
3) Choose whether to create an empty project or to create a
project based on an exported project file or a sample project.
4) On the New project screen, add a name and optional
description for the project.
If the project file that you select to import is encrypted, you must
enter the password that was used for encryption to enable decrypting
sensitive connection properties.
5) Create a New project in Data analysis section
6) Add to Project
7) Select Data for data manipulation.
8) Assets and Browse the data
9) Refine the data:
10) Visualizations
11) Columns to visualize:
12) Visualize the Data
RESULT:
Thus, the IBM-Watson Studio Project has been Executed
Successfully.
EX.NO: 9 DATA ANALYSIS – COVID 19 DATASET
DATE:
AIM:
To do the data analysis and visualization for covid19
dataset.
PROCEDURE AND CODE:
Import the Libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
2) Read the Covid Analysis Data set:
data=pd.read_csv(r"D:\Dataset\Covid
Analysis.csv") print(data.head())
OUTPUT:
3) Getting statistical information from the data:
print(data.describe())
OUTPUT:
4) Making serial number as an index:
print(data.index.name == 'S_No')
OUTPUT:
5) Removing Null value:
new_data =
data.drop(0)
print(new_data)
OUTPUT:
6) Describing the new cleaned dataset:
print(new_data.describe())
OUTPUT:
7) Fetching the information of Tamilnadu:
Tn = new_data.loc[29]
print(Tn)
OUTPUT:
8) Plotting a bargraph for STATE Vs Death:
plt.figure(figsize=(10,10))
plt.bar(new_data['Name of State / UT'],new_data['Death'])
plt.xticks(rotation=90)
plt.show()
OUTPUT:
9) Plotting a graph All variables for an each state:
plt.plot(new_data['Name of State / UT'],new_data['Date'],color='Blue')
plt.scatter(new_data['Name of State / UT'],new_data['Total Confirmed cases*
'],color='Blue')
plt.plot(new_data['Name of State /
UT'],new_data['Cured/Discharged/Migrated'],color='Red')
plt.scatter(new_data['Name of State / UT'],new_data['Death'],color='Red')
plt.plot(new_data['Name of State / UT'],new_data['Latitude'],color='Green')
plt.scatter(new_data['Name of State / UT'],
new_data['Longitude'],color='Green')
plt.xticks(rotation=90)
Plt.show()
OUTPUT:
RESULT:
The Data analysis Program has been Executed Successfully.