Ex 1.
Install the data Analysis and Visualization tool: R/ Python /Power BI.
DATE:
1. Installation:
pip install pandas
2. Creating A DataFrame in Pandas:
# assigning two series to s1 and s2
s1 = pd.Series([1,2])
s2 = pd.Series(["Ashish", "Sid"])
# framing series objects into data
df = pd.DataFrame([s1,s2])
# show the data frame df
# data framing in another way
# taking index and column values
dframe = pd.DataFrame([[1,2],["Ashish", "Sid"]],
index=["r1", "r2"],
columns=["c1", "c2"])
dframe
# framing in another way #
dict-like container
dframe=pd.DataFrame({
"c1": [1, "Ashish"],
"c2": [2, "Sid"]})
dframe
3. Importing Data with Pandas
# Import the pandas library, renamed as pd
import pandas as pd
# Read IND_data.csv into a DataFrame, assigned to df
df = pd.read_csv("IND_data.csv")
# Prints the first 5 rows of a DataFrame as default
df.head()
# Prints no. of rows and columns of a DataFrame
df.shape
4. Indexing DataFrames with Pandas
# prints first 5 rows and every column which replicates df.head() df.iloc[0:5,:]
# prints entire rows and columns
df.iloc[:,:]
# prints from 5th rows and first 5 columns
df.iloc[5:,:5]
1
5. Indexing Using Labels in Pandas
# prints first five rows including 5th index and every columns of dfdf.loc[0:5,:]
# prints from 5th rows onwards and entire columnsdf =
df.loc[5:,:]
# Prints the first 5 rows of Time period#
value
df.loc[:5,"Time period"]
6. Installation
pip install matplotlib
7. Pandas Plotting
# import the required module
import matplotlib.pyplot as plt
# plot a histogram
df['Observation Value'].hist(bins=10)
# shows presence of a lot of outliers/extreme values
df.boxplot(column='Observation Value', by = 'Time period')
# plotting points as a scatter plot
x = df["Observation Value"]
y = df["Time period"]
plt.scatter(x, y, label= "stars", color= "m",marker= "*", s=30)
# x-axis label
plt.xlabel('Observation Value')#
frequency label
plt.ylabel('Time period')
# function to show the plot
plt.show()
Output:
2
Ex 2. Perform exploratory data analysis (EDA) on with datasets
like email data set. Export all your emails as a dataset, import
DATE:
them inside a pandas data frame, visualize them and get
different insights from the data.
PROGRAM:
#import required modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#read the dataset to be accessed
df=pd.read_csv(r"C:\\Users\sibis\Desktop\studies\2nd sem\dev lab\EMAIL.csv")
#headfunction()-
print(df.head())
#dimentions of table
print(df.shape)
#table info
print(df.info)
#table statistics
print(df.describe)
#unique values is a particular
attribute print(df.TO.unique())
#subject of mail
plt.plot(df["FROM"],df["SUBJECT"])
plt.show()
3
OUTPUT:
4
Ex 3. Working with Nupy arrays, Pandas data frames, Basic plots using
DATE: Matplotlib.Numpy arrays using matplotlib
Program:
import numpy as np
from matplotlib import pyplot as plt
x = np.arange(1,11)
y=2*x+5
plt.title("Matplotlib demo")
plt.xlabel("x axis caption")
plt.ylabel("y axis caption")
plt.plot(x,y)
plt.show()
Output:
3(b)Pandas dataframe using matplotlib
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Name': ['John', 'Sammy', 'Joe'],'Age': [45, 38, 90]})
df.plot(x="Name", y="Age", kind="bar")
5
Basic plots
import matplotlib.pyplot as plt
x = [1,1.5,2,2.5,3,3.5,3.6]
y = [7.5,8,8.5,9,9.5,10,10.5]
x1=[8,8.5,9,9.5,10,10.5,11]
y1=[3,3.5,3.7,4,4.5,5,5.2]
plt.scatter(x,y, label='high income low saving',color='r')
plt.scatter(x1,y1,label='low income high savings',color='b')
plt.xlabel('saving*100')
plt.ylabel('income*1000')
plt.title('Scatter Plot')
plt.legend()
plt.show()
6
Ex 4. Explore various variable and row filters in python for cleaning data. Apply
various plot features in python on sample data sets and visualize
DATE:
Program:
# import the pandas library
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df)
Output:
one two three
a 0.077988 0.476149 0.965836
b NaN NaN NaN
c -0.390208 -0.551605 -2.301950
d NaN NaN NaN
e -2.000303 -0.788201 1.510072
Program :( Checking duplicate)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df['one'].isnull())
7
Output:
a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool
Program:(filling missing data)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print df
print ("NaN replaced with '0':")
print (df.fillna(0))
Output:
one two three
a -0.576991 -0.741695 0.553172
b NaN NaN NaN
c 0.744328 -1.735166 1.749580
NaN replaced with '0':
one two three
a -0.576991 -0.741695 0.553172
b 0.000000 0.000000 0.000000
c 0.744328 -1.735166 1.749580
Program:(Drop missing values)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df.dropna())
8
Output:
one two three
a 0.077988 0.476149 0.965836
c -0.390208 -0.551605 -2.301950
e -2.000303 -0.788201 1.510072
f -0.930230 -0.670473 1.14661
Program:(Replace missing or generic values)
import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print (df.replace({1000:10,2000:60}))
Output:
one two
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
9
Ex 5. Perform Time Series Analysis and apply the various visualization techniques.
DATE:
Program:
import matplotlib as mpl import
matplotlib.pyplot as plt import
seaborn as sns
import numpy as np
import pandas as pd
plt.rcParams.update({'figure.figsize': (10, 7), 'figure.dpi': 120})
# Import as Dataframe
df=pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv',
parse_dates=['date'])
df.head()
Date Value
0 1991-07-01 3.526591
1 1991-08-01 3.180891
2 1991-09-01 3.252221
3 1991-10-01 3.611003
4 1991-11-01 3.565869
# Time series data source: fpp pacakge in R.
import matplotlib.pyplot as plt
df=pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv',
parse_dates=['date'], index_col='date')
# Draw Plot
def plot_df(df, x, y, title="", xlabel='Date', ylabel='Value', dpi=100):
plt.figure(figsize=(16,5), dpi=dpi)
plt.plot(x, y, color='tab:red')
plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
plt.show()
10
plot_df(df, x=df.index, y=df.value, title='Monthly anti-diabetic drug sales in
Australia from 1992 to 2008.')
Output:
11
Ex 6. Perform Data Analysis and representation on a Map using various Map data sets with
Mouse Rollover effect, user interaction, etc.
DATE:
Program:
import plotly.express as px
import pandas as pd
print("getting data")
df=px.data.carshare()
print(df.head(10))
print(df.tail(10))
fig=px.scatter_mapbox(df,
lon=df["centroid_lon"],
lat=df["centroid_lat"],
zoom=10,
color=df["peak_hour"],
size=df["car_hours"],
width=1200,
height=900,
title="CAR SHARE SCATTER MAP")
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":50,"b":10})
fig.show()
12
Output
13
Ex 7. Build cartographic visualization for multiple datasets involving
DATE: various countries of the worldstates and districts in India etc.
Program:
Output:
14
Ex 8. Perform EDA on Wine Quality Data Set
DATE:
Program:
15
Output:
16
Ex 9. Use a case study on a data set and apply the various EDA and
visualization techniques andpresent an analysis report
DATE:
Program:
import pandas as pd
import numpy as np
import seaborn as sns
#Load the data
df =pd.read_csv('titanic.csv')
#View the data
df.head()
#Basic information
df.info()
#Describe the data
df.describe()
17
Describe the data - Descriptive statistics.
Duplicate values
df. duplicated().sum()
Output:
0
This means, there is not a single duplicate value present in our dataset.
18
Unique values in the data
#unique values
df['Pclass'].unique()
df['Survived'].unique()
df['Sex'].unique()
array([3, 1, 2], dtype=int64)
array([0, 1], dtype=int64)
array(['male', 'female'],dtype=object)
Visualize the Unique counts
#Plot the unique values
sns.countplot(df['Pclass']).unique()
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
19
Replace the Null values
A replace() function to replace all the null values with a specific data.
#Replace null values
df.replace(np.nan,'0',inplace = True)
#Check the changes now
df.isnull().sum()
PassengerId
Survived
Pclass 0
Name0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 0
Embarked0dtype: int64
lOMoARcPSD|272 628 94
Know the datatypes
df.dtypes
PassengerId
int64
Survived int64
Pclass int64
Name object
Sex object
Age object
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
Filter the Data
df[df['Pclass']==1].head()
20
A quick box plot
df[['Fare']].boxplot()
Correlation Plot - EDA
df.corr()
#Correlation plot
sns.heatmap(df.corr())
21