Fdsa Lab Manual Final
Fdsa Lab Manual Final
TRICHY- 12
B.E.(EEE)
REGULATION R-2021
IV SEMESTER
REGULATION R-2021
LABORATORY MANUAL
2
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY LTPC
004 2
COURSE OBJECTIVES:
To develop data analytic code in python
To be able to use python libraries for handling data
To develop analytical applications using python
To perform data visualization using plots
LIST OF EXPERIMENTS
Tools: Python, Numpy, Scipy, Matplotlib, Pandas, statmodels, seaborn, plotly, bokeh
Working with Numpy arrays.
COURSE OUTCOMES:
CO DESCRIPTION EXP.NO.
1. Write python programs to handle data using Numpy and 1,2
Pandas
3
CONTENTS
1
Ex.No:01
BASICPLOTSUSINGMATPLOTLIB
AIM
To write a program to perform various plot in Matplotlib.
PROCEDURE
2
PROGRAM#1
# Line graph example
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
y1 = [1, 3, 5, 3, 1, 3, 5, 3, 1]
y2 = [2, 4, 6, 4, 2, 4, 6, 4, 2]
plt.plot(x, y1, label="line L")
plt.plot(x, y2, label="line H")
plt.xlabel("x axis")
plt.ylabel("y axis")
plt.title("Line Graph Example")
plt.legend()
plt.show()
PROGRAM#2
#BarchartExample
import matplotlib.pyplot as plt
# The index 4 and 6 demonstrate overlapping cases.
x1=[1,3,4,5,6,7,9]
y1=[4,7, 2, 4, 7, 8, 3]
x2=[2, 4, 6, 8, 10]
y2=[5, 6,2, 6, 2]
plt.bar(x1, y1, label="Blue Bar", color='b')
plt.bar(x2, y2, label="Green Bar", color='g')
plt.plot()
plt.xlabel("bar number")
plt.ylabel("bar height")
plt.title("Bar Chart Example")
plt.legend()
plt.show()
3
PROGRAM#3
#ScatterplotExample
import matplotlib.pyplot as plt
x1=[2,3,4]
y1=[5, 5, 5]
x2=[1, 2, 3, 4, 5]
y2=[2, 3, 2, 3, 4]
y3=[6, 8, 7, 8, 7]
plt.scatter(x1,y1)
plt.scatter(x2, y2, marker='v', color='r')
plt.scatter(x2,y3,marker='^',color='m')
plt.title('Scatter Plot Example')
plt.show()
4
PROGRAM#4
#HistogramExample
import matplotlib.pyplot as plt
import numpy as np
#Creatingdataset
a=np.array([61,63,64,66,68,69,69.5,70,72,
72.5,73,73.5,74,74.5,76,76.2,
76.5,77,77.5,78,78.5,79,79.2,
80,81,82,83,84, 85,87])
#Creatinghistogram
fig, ax = plt.subplots(figsize =(10, 7))
ax.hist(a,bins=[60,65,70,75,80,85,90])
# Show plot
plt.show()
5
RESULT
Thus the implementation of various plots has been executed successfully.
6
Ex.No:02
WORKING WITH NUMPY ARRAYS
AIM
To write a program in python to perform multi dimension array manipulation usingNumpy.
PROCEDURE
What is NumPy?
NumPy is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra, fourier transform, and
matrices.
NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can
use it freely.
NumPy stands for Numerical Python.
7
PROGRAM
(2.1).MatrixAddition
Program:
import numpyasnp
a=np.array([[1,2,3],[4,5,6]])
b=np.array([[10,11,12],[13,14,15]])
c=a+b
print("a=",a)
print("b=",b)
print("Addition of a andb = ", c)
Output:
(2.2).MatrixMultiplication
Program:
import numpy as np
a=np.array([[1,2,3],[4,5,6]])
b=np.array([[10,11,12],[13,14,15]])
c=a*b
print("a=",a)
print("b=",b)
print("Multiplication of a and b = ", c)
Output:
(2.3).ScalarMultiplicationofMatrix
Program:
import numpy as np
a=np.array([[1,2,3],[4,5,6]])
b=a*3
print("a=", a)
print("b=a*3=",b)
Output:
8
(2.4).MatrixTranspose
Program:
import numpy as np
a=np.array([[1,2,3],[4,5,6],[7,8,9]])
b=a.T
print("a = \n", a)
print("Transpose of a=\n",b)
Output:
(2.5).ArrayDatatypeConversion
Program:
import numpy as np
a=np.array([[2.5,3.8,1.5],[4.7,2.9,1.56]])
b=a.astype('int')
print("Thearrayinfloatdatatype=\n",a)
print("The array in int datatype=\n", b)
Output:
(2.6).Stackingofnumpyarrays
Program:
import numpyasnp
a1=np.array([[1,2,3],[4,5,6]])
a2=np.array([[7,8,9],[10,11,12]])
c=np.hstack((a1,a2))
d=np.vstack((a1,a2))
print("Thetwoarraysare:\na1=\n",a1,"\na2=\n",a2)
print("\nHorizontalstacking:\n",c)
print("\nVertical stacking:\n", d)
Output:
9
(2.7).Sequence generation
Program:
import numpy as np
lists = [x for x in range(0,101, 2)]
a=np.array(lists)
print(a)
Output:
(2.8).Sorting anarray
Program:
import numpy as np
a=np.array([[1,4,2],[3,4,6],[0,-1,5]])
print("Array beforesorting")
print(np.sort(a, axis=None))
print("Sorting in row wise :")
print(np.sort(a, axis=1))
print("Sortingincolumnwise:")
print(np.sort(a,axis=0))
Output:
10
RESULT
11
Ex.No:03
WORKING WITH PANDAS DATAFRAMES
AIM
To perform various operations on dataframe using pandas module in python.
PROCEDURE
What is Pandas?
Pandas is a Python library used for working with datasets.
It has functions for analyzing,cleaning,exploring,and manipulating data.
The name "Pandas" has a reference to both "PanelData", and
"PythonDataAnalysis" and was created by WesMcKinney in 2008.
12
PROGRAM
(3.1).Creatingdataframe
Import pandas as pd
data=[['name1',21,'[email protected]',1234567891],
['name2',26,'[email protected]',1234567892]]
df=pd.DataFrame(data,columns=['NAME','AGE','EMAIL
ID','PHONENUMBER'],index=[1,2])
print(df)
Output:
(3.2).Creatingdataframeusingdictionary
Import pandas as pd
data={'Name':['aa','bb','cc'],'Age':[20, 21,25]}
df=pd.DataFrame(data)
print(df)
Output:
(3.3).Creatingdataframefromaseries
Import pandas as pd
data={'ONE':pd.Series([10,20,30,40],index=[1,2,3,4]),
'TWO':pd.Series([50,60,70,80],index=[1,2,3,4])}
df=pd.DataFrame(data)
print(df)
Output:
(3.4).Sortingthedataframe
Import pandas as pd
data= {'Name' : ['name1','name2','name3'],'Age' : [20, 21, 22]}
df=pd.DataFrame(data)
print("\nDatasetbeforesorting:\n",df)
d_sort1=df.sort_values(by='Name')
print("\nDataset aftersortedby Name:\n", d_sort1)
d_sort2=df.sort_values(by='Age')
13
print("\nDataset aftersortedby Age :\n", d_sort2)
Output:
(3.5).Manipulationofdataframe
(i) Selectionofcolumn:
SourceCode:
Import pandas as pd
data={'ONE':pd.Series([10,20,30,40],index=[1,2,3,4]),
'TWO':pd.Series([50,60,70,80], index=[1,2,3,4])}
df = pd.DataFrame(data)
print(" ---------------------- ")
print(df)
print(" ---------------------- ")
print("Selecting
rowONE")print(df['ONE'])
print(" ---------------------- ")
print("SelectingrowTWO")
print(df['TWO'])
print(" ---------------------- ")
Output:
(ii) Additionofcolumn:
14
Import pandas as pd
data={'ONE':pd.Series([10,20,30,40],index=[1,2,3,4]),
'TWO':pd.Series([50,60,70,80],index=[1,2,3,4])}
df = pd.DataFrame(data)
print(" ---------------------- ")
print("Data Frame beforeadding a new column")
print(df)
print(" ---------------------- ")
df['THREE']=pd.Series([90,100,110,120],index=[1,2,3,4])
print("Data Frame afteradding a newcolumn\n", df)
print(" ---------------------- ")
Output:
15
(iv) Selectionofrows
SourceCode:
Import pandas as pd
data={'ONE':pd.Series([0,1,2,3],index=['a','b','c','d']),
'TWO':pd.Series([4,5,6,7],index=['a','b','c','d'])}
print(" --------------------- ")
df = pd.DataFrame(data)
print("DataFrame:\n",df)
print(" --------------------- ")
print("row'c':")
print(df.loc['c'])
print(" --------------------- ")
Output:
Output:
Import pandas as pd
df=pd.DataFrame([[1,2],[3,4]],columns=['a','b'])
print("DataFrame:")
print(df)
df=df.drop(0)or
df1=df[df[‘age’]>20]
print(df1)
print("DataFrame afterdeleting therow 0 :")
print(df)
Output:
17
RESULT
The various operations on data frame using pandas module in python has been implemented
and executed successfully.
18
Ex.No:04
READING DATA FROM VARIOUS SOURCES
AIM
To Read data from Text file, Excel file and webpage using python functions.
PROCEDURE
To access data from the CSV file, we require a function read_csv() that retrieves data
in the form of the dataframe.
pd.read_csv(filepath_or_buffer,sep=’,’,header=’infer’,index_col=None,us
ecols=None,engine=None,skiprows=None,nrows=None)
Toaccessdatafromtheexcelfile,werequireafunctionread_excel()thatretrievesdataintheform
ofthedataframe.
To access data from the webpage, we require a function read_html() that retrieves data in
the form of the dataframe.
19
PROGRAM
(4.1)Reading datafromatextfile
T=open(r'Data.txt')
print(T.read())
Output:
(4.2)ReadingtheCSVfile
Import pandas as pd
data=pd.read_csv(r'Data.csv')
print(data)
Output:
(4.3)Readingtheexcelfile
Program
Import pandas as pd
data=pd.read_excel(r'Data.xlsx')
print(data)
Output:
20
(4.4)Readingfromweb
Pip install lxml
Pip install html5lib
pip install bs4
Program:
import pandas as pd
url=https://en.wikipedia.org/wiki/Iris_flower_data_set
df=pd.read_html(url)
print(df)
Output:
[ DatasetorderSepallength...Petalwidth Species
[150rowsx6columns]
21
RESULT
The operation on reading a data from the various resourses in python has been
implemented and executed successfully.
22
Ex.No:05a
IMPLEMENTATION OF FREQUENCY
DISTRIBUTIONS
AIM
To write a python program for to implement a concept of Frequency distribution.
PROCEDURE
The frequency (f) of a particular value is the number of times the value occurs in thedata.
The distribution of a variable is the pattern of frequencies, meaning the set of all
possible values and the frequencies associated with these values.
Frequency distributions are portrayed as frequency tables or charts.
Frequencydistributions can show either the actual number of observations falling in each
range or the percentage of observations. In the latter instance, the distribution is called a
relative frequency distribution.
Frequency distribution is implemented by the function crosstab().
Syntax:pandas.crosstab(index,columns,values=None,rownames=None,colnames=None,a
ggfunc=None,margins=False,margins_name=’All’,dropna=True,normalize=False)
Arguments:
index:array-like,Series,orlistofarrays/Series,Values to groupby in the rows.
columns:array-like,Series,orlistofarrays/Series,Values to groupby in the columns.
values:array-like,optional,array of values to aggregate according to the factors.
Requires`aggfunc`be specified.
rownames:sequence,default None, If passed,must match number of row arrays
passed.
colnames:sequence,default None,If passed,must match number of column arrays
passed.
aggfunc:function,optional,If specified,requires`values`be specified as well.
margins:bool,defaultFalse,Addrow/columnmargins(subtotals).
margins_name:str,default‘All’,Name of the row/column that will contain the totals
when margins is True.
dropna:bool,default True,Do not include columns who seen tries a reallNaN.
23
PROGRAM
Import pandas as
pdData=pd.Series([1,1,1,1,2,3,3,3,3,3,4,4,4,5])
Print(Data.value_counts())
PrintData.value_counts(sort=False)
df=pd.DataFrame({‘Grade’:[‘A’,’A’,’A’,’A’,’A’,’B’,’B’,’B’,’B’,’C’,’
D’,’D’],’Age’:[18,18,18,19,19,,20,18,18,19,19],’Gender’:[‘M’,’M’,’
F’,’F’,’F’,’M’,’M’,’F’,’M’,’F’]})
Print(df)
Print(pd.crosstab(index=df[‘Grade’],columns=’count’))
Print(pd.crosstab(index=df[‘Age’],columns=’count’))
tab=pd.crosstab(index=df[‘Age’],columns=’count’))
Print(tab/tab.sum())
Print(pd.crosstab(index=df[‘Age’],columns=df[’Grade’]))
Output:
24
RESULT
The implementation of the concept of frequency distribution in python has been executed successfully.
25
Ex.No:05b
IMPLEMENTATIONOFAVERAGESANDVARIABILI
TY
AIM
To write a python program for to implement a concept of Average and Variability.
PROCEDURE
The mean() function is used to return the mean of the values for the requested axis.
If we apply this method on a Series object, then it returns a scalar value, which is the
mean value of all the observations in the data frame.
If we apply this method on a DataFrame object, then it returns a Series object which
contains mean of values over the specified axis.
DataFrame.mean(axis=None,skipna=None,level=None,numeric_only=None,**kwargs)
Parameters
axis:{index(0),columns(1)}.
This refers to the axis for a function that is to be applied.
skipna:It excludes all the null values when computing result.
level: It counts along with a particular level and collapsing into a Series if the axis is a
MultiIndex(hierarchical),
numeric_only:It includes only int,float,Boolean columns.If None,it will attempt to
use everything, then use only numeric data. Not implemented for Series.
It returns the mean of the Series or DataFrame if the level is specified.
The Pandas std() is defined as a function for calculating the standard deviation of the
given set ofnumbers,DataFrame, column,androws.
To calculate the standard deviation, we need to import the package named "statistics" for
the calculation of median.
Series.std(axis=None,skipna=None,level=None,ddof=1,numeric_only=None,**kwargs)
It returns Series or DataFrame if the level is specified.
26
PROGRAM
AVERAGEANDVARIABILITYusingnumpyandList:
1. import numpy as np
list=[2, 4,4,4, 5,5, 7,9]
print(np.average(list)) # Calculating average using average()
print(np.var(list)) # Calculating variance using var()
print(np.std(list)) #Calculatingstandarddeviationusingvar()
OUTPUT
5.0
4.0
2.0
2. import numpy as np
x=np.arange(5)#Originalarray
print(x)
r11=np.mean(x)
r12=np.average(x)
print("\nMean:",r11,r12)
r21= np.std(x)
r22=np.sqrt(np.mean((x-np.mean(x))**2))
print("\nstddev:",r21,r22)
r31=np.var(x)
r32=np.mean((x-np.mean(x))**2)
print("\nvariance: ", r31, r32)
OUTPUT
[01234]
Mean:2.02.0
stddev:1.41421356237309511.4142135623730951
variance:2.02.0
AVERAGEANDVARIABILITYusingnumpyandDictionary
Import numpy as np
dicti={'a':20,'b':32,'c':12,'d':93,'e':84}#creatingourtestdictionary
listr= []
for value in dicti.values()
:listr.append(value)
mean=np.mean(listr)st
d = np.std(listr)
print(mean)
print(std)
OUTPU
T
48.2
33.63569532505609
27
AVERAGEANDVARIABILITYusingPandas
1. import pandas as
2. pd
s=pd.Series(data=[5,9,8, 5,7,8, 1,2,3,4,5, 6,7,8, 9,5,3])
print(s)
print(s.mean())# finding the mean
print(s.std()) #findingtheStandarddeviation
OUTPUT
0 5
1 9
2 8
3 5
4 7
5 8
6 1
7 2
8 3
9 4
10 5
11 6
12 7
13 8
14 9
15 5
16 3
dtype: int645.588235294117647
2.450990196058824
3. import pandas as
4. pd
df=pd.DataFrame({'ID':[114,345,157788,5626],'Discount':[10,20,10,50]})
print(df)
print(df.mean())# finding the mean
print(df.std()) # finding the Standard deviation
OUTPUT
IDDiscount0
114 10
1 345 20
215778810
3 5626 50
ID 40968.25
Discount 22.50
dtype:float64
ID 77921.427966
Discount 18.929694
dtype:float64
28
RESULT
The implementation of the concept of average and variability in python has been executed successfully.
29
Ex.No:06a
IMPLEMENTATIONOFNORMALCURVES
AIM
To write a python program for to implement a concept of NormalDistribution.
PROCEDURE
In a normal distribution, data are symmetrically distributed with no skew. Most values cluster
around a central region, with values tapering off as they go further away from the center.
The measures of central tendency (mean, mode, and median) are exactly the same in a normal
distribution.
In Python, we implement normal distribution using norm().
scipy.stats.norm() is a normal continuous random variable.
It is inherited from the generic methods as an instance of the rv_continuous class.
It completes the methods with details specific for this particular distribution.
Parametersareusedinnorm()is,
q: lower and upper tail probability
x: quantiles
loc: [optional] location parameter. Default = 0
scale: [optional] scale parameter. Default = 1
size: [tuple of ints, optional] shape or random variates.
moments: [optional] composed of letters ['mvsk']; 'm' = mean, 'v' = variance, 's' = Fisher’s skew
and 'k' = Fisher’s kurtosis (default = 'mv').
norm() function returns normal continuous random variable.
30
PROGRAM#1
import numpy as np
import matplotlib.pyplot as plt
PROGRAM#2
From scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
data=np.arange(1,10,0.01)#Creatingthedistribution
pdf = norm.pdf(data , loc = 5.3 , scale = 1 )
sb.set_style('whitegrid')#Visualizing the distribution
31
sb.lineplot(data, pdf , color = 'black')
plt.xlabel('Heights')
plt.ylabel('Probability Density')
OUTPUT
32
PROGRAM#3
# import required libraries
From scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
# Creating the distribution
data=np.arange(1,10,0.01)
pdf = norm.pdf(data , loc = 5.3 , scale = 1 )
#Probability of height to be under 4.5
ft.prob_1=norm(loc=5.3,scale=1).cdf(4.5)
print(prob_1)
#probabilitythattheheightofthepersonwillbebetween6.5and4.5ft.
cdf_upper_limit= norm(loc= 5.3 , scale=1).cdf(6.5)
cdf_lower_limit = norm(loc = 5.3 , scale = 1).cdf(4.5)
prob_2 = cdf_upper_limit - cdf_lower_limit
print(prob_2)
#probabilitythattheheightofapersonchosenrandomlywillbeabove6.5ft
cdf_value = norm(loc =5.3,scale= 1).cdf(6.5)
prob_3 = 1- cdf_value
print(prob_3)
sb.set_style('whitegrid')
pdf1=sb.lineplot(data,pdf,color='black')
pdf1.axvline(4.5)
pdf1.axvline(6.5)
plt.xlabel('Heights')
plt.ylabel('Probability Density')
plt.legend()
plt.show()
OUTPUT
0.21185539858339675
0.673074931194895
0.11506967022170822
33
RESULT
The implementation of the concept of normal distribution in python has been executed successfully.
34
Ex.No:06b
IMPLEMENTATIONOFCORRELATI
ON
AIM
To write a python program for to implement a concept of Correlation.
PROCEDURE
Correlation refers to a process for establishing the relationships between two variables.
To get a general idea about whether or not two variables are related, it is helpful to plot them on
a scatterplot.
Methods of correlation summarize the relationship between two variables in a single number called
the correlation coefficient. The correlation coefficient is usually represented using the symbol rr, and
it ranges from -1 to +1.
A correlation coefficient quite close to 0, but either positive or negative, implies little or no
relationship between the two variables.
A correlation coefficient close to +1 indicates a positive relationship between the two variables, with
increases in one of the variables being associated with increases in the other variable.
A correlation coefficient close to -1 indicates a negative relationship between two variables, with an
increase in one of the variables being associated with a decrease in the other variable.
A correlation coefficient can be produced for ordinal, interval, or ratio level variables but has little
meaning for variables measured on a scale that is no more than nominal.
For ordinal scales, the correlation coefficient can be calculated using Spearman’s rho.
For interval or ratio level scales, the most commonly used correlation coefficient is Pearson’s rr,
ordinarily referred to as simply the correlation coefficient.
Pearson correlation is implemented by the stats.pearsonr() function.
35
PROGRAM
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import pandas as pd
experience=[1,3,4,5,5,6,7, 10,11,12,15,20,25,28,30,35]
salary=[20000,30000, 40000,45000,55000,60000,80000,100000, 130000,150000,
200000,230000,250000,300000,350000,400000]
data={'expr':[1,3,4,5,5,6,7,10,11,12,15,20,25,28,30,35],
'sal':[20000,30000,40000,45000,55000,60000,80000,100000,130000,150000,
200000, 230000, 250000, 300000, 350000,
400000],'age':[20,30,36,40,28,40,50,54,49,47,45,39,29,40,30,
38]}
df=pd.DataFrame(data)
corr = stats.pearsonr (experience, salary)[0]
print("The rvalueis:",corr)
if(corr>=0.7):
print("Positivelystrongrelation")
elif(corr<=-0.7):
print("Negatively strong relation")
else:
print("No relation")
a=np.corrcoef(experience, salary)
plt.subplot(1,2,1)
plt.scatter(experience,salary,c="blue")
plt.xlabel("experience")
plt.ylabel("salay")
print("ThecorrelationMatrixis:\n",a)
r=df['expr'].corr(df['age'])
print("Thervalueis:",r)
if(r>=0.7):
print("Positivelystrongrelation")
elif(r<=-0.7):
print("Negatively strong relation")
else:
print("No relation")
plt.subplot(1,2,2)
plt.scatter(df.expr,df.age,c="red")
plt.xlabel("experience")
plt.ylabel("age")
plt.show()
OUTPUT
Thervalueis:0.9929845761480397P
ositivelystrongrelation
36
ThecorrelationMatrixis:
[[1. 0.99298458]
[0.992984581. ]]
Thervalueis:0.010682427800601435N
o relation
37
RESULT
The implementation of the concept of correlation coefficient in python has been executed successfully.
38
Ex.No:07
IMPLEMENTATIONOFLEASTSQUAREREGRESSIONMETHOD
AIM
To write a python program for to implement a concept of Least square Regression
linemethod.
PROCEDURE
The least-squares method is a crucial statistical technique used to find a regression line or a best-fit
line for a given pattern. This method is described by an equation with specific parameters.
The method of least squares is widely used in evaluation and regression. In regression analysis, this
method is considered a standard approach for approximating sets of equations that have more
equations than unknowns.
The least-squares method defines the solution for minimizing the sum of squares of deviations or
errors in the results of each equation. The formula for the sum of squares of errors helps to find the
variation in observed data.
The least-squares method is often applied in data fitting.
The best-fit result aims to reduce the sum of squared errors or residuals, which are the differences
between the observed or experimental values and the corresponding fitted values given in the model.
We can derive the line of best fit using the formula:
y=ax+by=ax+b
39
PROGRAM#1
Import numpy ascnp
import matplotlib.pyplot as plt
def estimate_coef(x,y):
# number of observations/points
n= np.size(x)
# mean of x and y vector
m_x= np.mean(x)
m_y=np.mean(y)
#Calculatingcross-deviationanddeviationaboutx
SS_xy= np.sum(y*x) -n*m_y*m_x
SS_xx=np.sum(x*x)-n*m_x*m_x
# Calculating regression coefficient
sb_1= SS_xy/ SS_xx
b_0=m_y-b_1*m_x
return(b_0, b_1)
def plot_regression_line(x,y,b):
# plotting the actual points as scatter plot
plt.scatter(x,y,color="m",marker="o",s=30)
# predicted response vector
y_pred=b[0] +b[1]*x
# plotting the regression
lineplt.plot(x,y_pred,color="g"
)
#puttinglabels
plt.xlabel('x')
plt.ylabel('y')
#functiontoshowplot
plt.show()
#observations/data
x=np.array([0,1, 2,3,4,5, 6,7,8,9])
y=np.array([1,3,2,5,7,8,8,9,10,12])
# estimating coefficient
sb=estimate_coef(x,y)
print("Estimatedcoefficients:\na={}\nb={}".format(b[0],b[1]))
print("Leastsquareregressionequationis:","{0:.3f}".format(b[0]),"+x","{0:.3f}".format(b[1]))
#plottingregressionline
plot_regression_line(x, y,b)
OUTPUT
Estimatedcoefficients:
a=1.2363636363636363
b=1.1696969696969697
Leastsquareregressionequationis:1.236+x1.170
40
PROGRAM#2
Import numpy as np
Import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
x=np.array([5,15,25,35,45,55]).reshape((-1,1))
y=np.array([5,20,14,32,22,38])
model = LinearRegression().fit(x, y)
r_sq = model.score(x, y)
print(f"coefficient of determination: {r_sq}")
y_pred= model.predict(x)
print(f"predicted response:\n{y_pred}")
plt.scatter(x,y, color = "m",marker = "o", s = 30)
plt.plot(x,y_pred,color ="green")
plt.xlabel('x')
plt.ylabel('y')
plt.show()
OUTPUT
coefficientofdetermination:0.7158756137479542predictedr
esponse:
[8.3333333313.7333333319.1333333324.5333333329.9333333335.33333333]
41
RESULT
The implementation of the concept of least square regression in python has been executed
successfully.
42
Ex.No:08
IMPLEMENTATIONOFMULTIPLEREGRESSI
ON
AIM
PROCEDURE
model=LinearRegression()
With .fit( ), you calculate the optimal values of the weights 𝑏₀ and 𝑏₁, using the existing
input and output,xandy,as the arguments.
In other words,.fit() fits the model.It returns self,which is the variable model itself.
43
PROGRAM
Import numpy as np
Import matplotlib.pyplot as plt
From sklearn.linear_model import LinearRegression
x=[[0,1],[5,1],[15,2], [25,5],[35,11],[45,15],[55,34],[60,35]]
y=[4,5,20, 14,32,22,38,43]
x, y = np.array(x), np.array(y)
model = LinearRegression().fit(x, y)
r_sq = model.score(x, y)
print(f"coefficient of determination: {r_sq}")
y_pred= model.predict(x)
y_pred = model.intercept_ + np.sum (model.coef_* x, axis=1)
print(f"predictedresponse:\n{y_pred}")
OUTPUT
coefficientofdetermination:0.8615939258756776predic
tedresponse:
[5.777604768.01295312.7386749717.974447923.9752972829.4660957
38.7822763341.27265006]
44
RESULT
The implementation of the concept of multiple regression in python has been executed
successfully.
45
Ex.No:09
IMPLEMENTATIONOFZTEST
AIM
To write a python program for to implement a concept of z test.
PROCEDURE
A Z-test is a statistical test used to determine whether two population means are different when the
variances are known and the sample size is large.
The test statistic is assumed to have a normal distribution, and nuisance parameters such as standard
deviation should be known in order for an accurate Z-test to be performed.
You can use the ztest() function from the statsmodels package to perform one-sample and two-sample
Z-tests in Python.
statsmodels.stats.weightstats.ztest(x1,x2=None,value=0)
where:
x1:values for the first sample
x2:values for the second sample(if performing a two sample z-test)
value: mean under the null (in one sample case) or mean difference (in two
sample case)
46
PROGRAM
Import math
Import numpy as np
From numpy.random import randn
From stats models.stats.weight stats import ztest
#Generatearandomarrayof50numbershavingmean110andsd15# similarto
the IQ scores data weassumeabove
mean_iq=110
sd_iq=15/math.sqrt(50)
alpha =0.05
null_mean =100
print(randn(50))
data = sd_iq*randn(50)+mean_iq
print(data)
#printmean andsd
print('mean=%.2fstdv=%.2f'%(np.mean(data),np.std(data)))
#nowweperformthetest.Inthisfunction,wepasseddata,inthevalueparameter
#wepassedmeanvalueinthenullhypothesis,inalternativehypothesiswecheckwhetherthemeanis larger
ztest_Score,p_value=ztest(data,value=null_mean,alternative='larger')
# the function outputs a p_value and z-score corresponding to that value, we compare the# p-
value withalpha,ifitis greaterthan alphathen we donot nullhypothesis
# else we reject it.
print(ztest_Score)
print(p_value)
if(p_value<alpha):
print("Reject Null Hypothesis")
else:
print("Retain NULL Hypothesis")
OUTPUT
[109.32317127107.5180639110.34687099109.24347939110.53062543
107.52136918108.82945441111.76442439109.52318027110.82466106
115.26944919107.92161923107.72658274110.41249653108.29400528
109.14010895108.72487358109.66275854111.32339348109.04384837
109.86205757107.5119657110.81380703112.01224839110.78607939
109.92631499111.87247694111.44013795107.80153879110.97192483
112.59626892105.76061577106.49222008112.77586897108.94970948
107.82422582105.03208528106.02078487109.01139522110.93278094
108.19795152107.90859201107.45543581106.06955274110.7309573
110.0331299107.9865958109.69378913111.6672574106.72201292]
mean=109.36
stdv=2.0432.15274767
2088914
4.0423380710270466e-227
RejectNullHypothesis
47
RESULT
The implementation of the concept of z-test in python has been executed successfully.
48
Ex.No:10
IMPLEMENTATION OF T
TEST
AIM
PROCEDURE
Attestisa statistical test that is used to compare the means of two groups.
This is a test for the null hypothesis that the expected value (mean) of a sample of
independent observation sais equal to the given population mean,popmean.
Returns:Thet-statistic,pvalue,dfvalue,CIValue
This is a test for the null hypothesis that 2 independent samples have identical
average(expected) values. This test assumes that the populations have identical variances by
default.
Returns:Thet-statistic,pvalue
This is a test for the null hypothesis that two related or repeated samples have identical
average(expected) values.
Returns:Thet-statistic,pvalue,dfvalue,CIValue
49
PROGRAM#1
One sample t-
testData:
Systolicbloodpressuresof14patientsaregivenbelow:
183,152,178,157,194,163,144,114,178,152,118,158,172,138
Test,whetherthepopulationmean,islessthan165
Hypothesis
H0:Thereisnosignificantmeandifferenceinsystolicbloodpressure.i.e.,μ=165H1:Thepop
ulationmeanisless than165.i.e.,μ < 165
Code:
sys_bp=[183,152,178,157,194,163,144,114,178,152,118,158,172,138]
mu=165
from scipy import statst_
value,p_value=stats.ttest_1samp(sys_bp,mu)
one_tailed_p_value=float("{:.6f}".format(p_value/2))
print('Test statistic is %f'%float("{:.6f}".format(t_value)))
print('p-value for one tailed test is %f'%one_tailed_p_value)
alpha= 0.05
if one_tailed_p_value<=alpha:
print('Conclusion','n','Since p value(=%f)'%p_value,'<','alpha(=%.2f)'%alpha,'''Wereject
the null hypothesis H0. So we conclude that there is no significant meandifference in systolic
blood pressure. i.e., μ = 165 at %.2f level ofsignificance'''%alpha)
else:
print('Conclusion','Sincep-value(=%f)'%one_tailed_p_value,'>','alpha(=%.2f)'%alpha,'Wedonotreject
thenull hypothesis H0.')
OUTPUT
Teststatisticis-1.243183
p-valueforonetailedtestis0.117877
ConclusionSincep-value(=0.117877)>alpha(=0.05)WedonotrejectthenullhypothesisH0.
PROGRAM#2
50
Urea=[12,11.7,10.7,11.2,14.8,14.4,13.9,13.7,16.9,16,15.6,16]
from scipy import statst_
value,p_value=stats.ttest_ind(Ammonium_chloride,Urea)
print('Test statistic is %f'%float("{:.6f}".format(t_value)))
print('p-value fortwo tailedtestis %f'%p_value)
alpha=0.05
ifp_value<=alpha:
print('Conclusion','n','Since p-value(=%f)'%p_value,'<','alpha(=%.2f)'%alpha,'''We reject
thenull hypothesis H0. So we conclude that the effect of ammonium chloride and urea on
grainyieldofpaddyarenotequal i.e., μ1 =μ2 at%.2flevelofsignificance.'''%alpha)
else:
print('Conclusion','n','Since p-value(=%f)'%p_value,'>','alpha(=%.2f)'%alpha,'''We do
notreject thenull hypothesis H0.’’’)
OUTPUT
Teststatisticis0.184650
p-valuefortwotailedtestis0.855195
ConclusionnSincep-value(=0.855195)>alpha(=0.05)WedonotrejectthenullhypothesisH0.
PROGRAM#3
Tworelatedsamplet-test:
ElevenschoolboysweregivenatestinStatistics.TheyweregivenaMonth’stuitionandsecondtests were held
at the end of it. Do the marks give evidence that the students have benefitedfromtheexamcoaching?
Marksin 1sttest:2320192118201817231619
Marksin 2ndtest:241922182022 202023 2018
Hypothesis
H0: The students have not benefited from the tuition class. i.e., d =
0H1:The studentshavebenefitedfrom the tuitionclass. i.e.,d<0
Where,d=x-y;disthedifferencebetweenmarksinthefirsttest(sayx)andmarksinthesecondtest (sayy).
Code
alpha=0.05
first_test=[23,20,19,21,18,20,18,17,23,16,19]
second_test=[24,19,22,18,20,22,20,20,23,20,18]
from scipy import statst_
value,p_value=stats.ttest_rel(first_test,second_test)
one_tailed_p_value=float("{:.6f}".format(p_value/2))
print('Test statistic is %f'%float("{:.6f}".format(t_value)))
print('p-value for one_tailed_test is %f'%one_tailed_p_value)
alpha= 0.05
if one_tailed_p_value<=alpha:
print('Conclusion','n','Sincep-value(=%f)'%one_tailed_p_value,'<','alpha(=%.2f)'%alpha,'''Wereject the
null hypothesis H0.So we conclude that the students have benefited by the tuitionclass. i.e.,d= 0 at
%.2flevel ofsignificance.'''%alpha)
else:
51
print('Conclusion','n','Since p-value(=%f)'%one_tailed_p_value,'>','alpha(=%.2f)'%alpha,'''Wedo not
reject the null hypothesis H0. So we conclude that the students have not benefited by thetuition
class.i.e.,d= 0at %.2flevel ofsignificance.'''%alpha)
OUTPUT
Teststatisticis-1.707331
p-valueforone_tailed_testis0.059282
ConclusionnSincep-value(=0.059282)>alpha(=0.05)WedonotrejectthenullhypothesisH0.So we
conclude that the students have not benefited by the tuition class. i.e., d = 0 at 0.05
levelofsignificance.
52
RESULT
The implementation of the concept of various type of t test in python has been executed successfully.
53
Ex.No:11
IMPLEMENTATION OF ANOVA
AIM
To write a python program for to implement a concept of ANOVA and Ftest.
PROCEDURE
An ANOVA test is a statistical test used to determine if there is a statistically significant
difference between two or more categorical groups by testing for differences of means
using a variance.
Another Key part of ANOVA is that it splits the independent variable into two or more
groups.
It is implemented by the f_oneway().
scipy.stats.f_oneway(*samples,axis=0)
Performone-wayANOVA.
The one-way ANOVA tests the null hypothesis that two or more groups have the same
population mean. The test is applied to samples from two or more groups, possibly with
differing sizes.
Parameters:
sample1,sample2,…array_like
The sample measurements for each group.There must be at least two arguments.If the arrays
are multidimensional, then all the dimensions of the array must be the same except for axis.
axisint,optional
Axis of the input arrays along which the test is applied.Default is0.
Returns:Fstatistic value and pvalue
54
PROGRAM#1
#Importing library
fromscipy.statsimportf_oneway
#Performancewheneachoftheengine
#oil is applied
performance1=[89,89,88,78,79]
performance2=[93,92,94,89,88]
performance3=[89,88,89,93,90]
performance4=[81,78,81,92,82]
#Conducttheone-wayANOVA
f,p_value=f_oneway(performance1, performance2, performance3, performance4)
print(f,p_value)
OUTPUT
4.625000000000002 0.016336459839780215
PROGRAM#2
import random
import numpy as np
import pandas as pd
import patsy
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import stats models.api as sm
from statsmodels.stats.anova import AnovaRM
from stats models.regression.mixed_linear_model import MixedLMResults
fromscipyimport stats
import seaborn as sns
subs_list=['01','02','03','04','05','06','07','08','09','10']
task_list = ['task1', 'task2',
'task3']condition_list=['pre','post']
#readdataintodataframe
df_2way_rm = pd.DataFrame(columns=["sub_id", "task", "condition",
"my_value"])my_row = 0
#uniquesubject-
IDasadditionalfactorsub_id= 0
for sub in
subs_list:sub_id=su
b_id+1
forind_t,taskinenumerate(task_list):
for ind_c, con in
enumerate(condition_list):#generaterandom
valuehereasexample
my_val = np.random.normal(ind_t + ind_c, 1,
1)[0]df_2way_rm.loc[my_row] = [sub_id, task, con,
my_val]my_row = my_row + 1
#conductANOVAusingmixedlmpri
nt(df_2way_rm)
55
my_model_fit=AnovaRM(df_2way_rm,'my_value','sub_id',within=['task','condition']).fit()
print(my_model_fit.anova_table)
OUTPUT
FValue NumDF Den DF
Pr>Ftask 24.844085 2.0 18.0
0.000007
condition 7.293545 1.0 9.0 0.024367
task:condition0.774146 2.0 18.0 0.4758
56
RESULT
The implementation of the concept of ANOVA in python has been executed successfully.
57
Ex.No:12
IMPLEMENTATION OF LOGISTIC
REGRESSION
Aim
Use logistic regression model for Predicting if a person would buy life insurance based on hisage.
PROCEDURE
Logistic regression is one of the most popular machine learning algorithms, which comes under the
supervised learning technique.
It is used for predicting the categorical dependent variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore, the outcome
must be a categorical or discrete value.
It can be either Yes or No, 0 or 1, True or False, etc. Instead of giving the exact values as 0 and 1, it
gives probabilistic values that lie between 0 and 1.
Logistic regression is much similar to linear regression except for how they are used.
Linear regression is used for solving regression problems, whereas logistic regression is used for
solving classification problems.
In logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).
Logistic regression can be used to classify observations using different types of data and can easily
determine the most effective variables used for classification.
58
PROGRAM
Import pandas as pd
From matplotlib import pyplot as plt
df = pd.read_csv("insurance_data.csv")
df.head()
plt.scatter(df.age,df.bought_insurance,marker='+',color='red')
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test
=train_test_split(df[['age']],df.bought_insurance,train_size=0.8)
fromsklearn.linear_modelimportLogisticRegression
model = LogisticRegression()
model.fit(X_train,y_train)
X_test
y_predicted = model.predict(X_test)
model.predict_proba(X_test)
#model.predict(20)
#calculating accuracy
model.score(X_test,y_test)
y_predicted
model.coef_mode
l.intercept_
59
RESULT
Thus logistic regression model for Predicting if a person would buy life insurance based on his
60
Ex.No:13
IMPLEMENTATIONOFTIMESERIESANALYS
IS
AIM
To perform Time Series Analysis of Open Power System Data.
PROCEDURE
Time series is a series of data points in which each data point is associated with a timestamp.
A simple example is the price of a stock in the stock market at different points of time on a given day.
Another example is the amount of rainfall in a region at different months of the year.
A time series is a set of observations that are collected after regular intervals of time. If plotted, the
time series would always have one of its axes as time.
Time Series Analysis in Python considers that data collected over time might have some structure;
hence it analyzes time series data to extract its valuable characteristics.
The time variable/feature is the independent variable and supports the target variable to predict the
results.
Time Series Analysis (TSA) is used in different fields for time-based predictions, such as:
Weather forecasting models
Stock market predictions
Signal processing
Engineering domain – Control Systems and Communications Systems
61
PROGRAM
import numpy as np
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
# load time series
datasetdf_power=
pd.read_csv("https://raw.githubusercontent.com/jenfly/opsd/master/opsd_ger
many_daily.csv")
print(df_power.columns)
print(df_power.shape)
print(df_power.tail(10))
print(df_power.dtypes)
#convertobjecttodatetimeformat
df_power['Date'] = pd.to_datetime(df_power['Date'])
print(df_power.dtypes)
df_power = df_power.set_index('Date')
print(df_power.tail(3))
print(df_power.index)
# Add columns with year, month, and weekday
namedf_power['Year']=df_power.index.year
df_power['Month'] = df_power.index.month_name()
df_power['Weekday_Name']=df_power.index.day_name()
#Let's generate a line plot of the full time series of Germany's daily electricity
consumptionprint(df_power.loc['2015-10-02'])
sns.set(rc={'figure.figsize':(11,4)})
plt.rcParams['figure.figsize']=(8,5)
plt.rcParams['figure.dpi'] = 100
df_power['Consumption'].plot(linewidth=0.5)
plt.show()
#Let'susethedotstoplotthedataforalltheothercolumnscols_to_plot=['
Consumption','Solar', 'Wind']
axes = df_power[cols_to_plot].plot(marker='.', alpha=0.5,linestyle='None',figsize=(14,
6),subplots=True)
for ax in axes:
ax.set_ylabel('DailyTotals(GWh)')
plt.show()
#Wecanfurtherinvestigateasingleyeartohaveacloserlook
ax= df_power.loc['2016', 'Consumption'].plot()
62
ax.set_ylabel('Daily Consumption (GWh)')
plt.show()
#Let'sexaminethemonthofDecember2016
ax = df_power.loc['2016-12', 'Consumption'].plot(marker='o', linestyle='-')
ax.set_ylabel('DailyConsumption(GWh)')
plt.show()
#To indicate power consumotion in a particular week of December, we can supply a specificdate
range
ax = df_power.loc['2016-12-23':'2016-12-30','Consumption'].plot(marker='o', linestyle='-')
ax.set_ylabel('DailyConsumption(GWh)');
plt.show()
#Wecanfirstgroupthedatabymonthsandthenusetheboxplotstovisualizethedata:fig, axes =
plt.subplots(3, 1,figsize=(8, 7),sharex=True)
for name, ax in zip(['Consumption', 'Solar', 'Wind'], axes):
sns.boxplot(data=df_power, x='Month', y=name, ax=ax)
ax.set_ylabel('GWh')
ax.set_title(name)
if ax!=axes[-1]:
ax.set_xlabel('')
plt.show()
# we can group the consumption of electricity by the day of the week, and present it in a boxplot:
sns.boxplot(data=df_power, x='Weekday_Name', y='Consumption');
plt.show()
#Wecanusethecodegivenheretoresampleourdata:
columns = ['Consumption', 'Wind', 'Solar', 'Wind+Solar']
power_weekly_mean = df_power[columns].resample('W').mean()
print(power_weekly_mean)
#Lastsixmonthsof2016
start,end='2016-01','2016-06'
fig,ax=plt.subplots()
ax.plot(df_power.loc[start:end, 'Solar'],marker='.', linestyle='-', linewidth=0.5, label='Daily')
ax.plot(power_weekly_mean.loc[start:end, 'Solar'],marker='o', markersize=8, linestyle='-
',label='WeeklyMeanResample')
ax.set_ylabel('Solar Production in (GWh)')
ax.legend()
plt.show()
63
OUTPUT
Index(['Date', 'Consumption', 'Wind', 'Solar', 'Wind+Solar'],
dtype='object')(4383,5)
DateConsumption WindSolarWind+Solar
4373 2017-12-22 1423.23782 228.773 10.065 238.838
4374 2017-12-23 1272.17085 748.074 8.450 756.524
4375 2017-12-24 1141.75730 812.422 9.949 822.371
4376 2017-12-25 1111.28338 587.810 15.765 603.575
4377 2017-12-26 1130.11683 717.453 30.923 748.376
4378 2017-12-27 1263.94091 394.507 16.530 411.037
4379 2017-12-28 1299.86398 506.424 14.162 520.586
4380 2017-12-29 1295.08753 584.277 29.854 614.131
4381 2017-12-30 1215.44897 721.247 7.467 728.714
4382 2017-12-31 1107.11488 721.176 19.980 741.156
Date object
Consumptionfloat64Wi
nd float64
Solar
float64
Wind+Solarfloat64dtyp
e:object
Date
datetime64[ns]C
onsumption float64
Wind float64
Solar float64
Wind+Solar
float64
dtype:object
Consumption WindSolarWind+Solar
Date
2017-12-291295.08753584.27729.854 614.131
2017-12-301215.44897721.2477.467 728.714
2017-12-311107.11488721.17619.980 741.156
DatetimeIndex(['2006-01-01','2006-01-02','2006-01-03','2006-01-04',
'2006-01-05','2006-01-06','2006-01-07','2006-01-08',
'2006-01-09','2006-01-10',
...
'2017-12-22','2017-12-23','2017-12-24','2017-12-25',
'2017-12-26','2017-12-27','2017-12-28','2017-12-29',
'2017-12-30','2017-12-31'],
dtype='datetime64[ns]',name='Date',length=4383,freq=None)Cons
umption Wind Solar...Year MonthWeekday_Name
Date ...
2008-08-23 1152.011 NaN NaN...2008August Saturday
2013-08-08 1291.98479.666 93.371... 2013August Thursday
2009-08-27 1281.057 NaN NaN...2009August Thursday
64
2015-10-02 1391.05081.229 160.641...2015October Friday
2009-06-02 1201.522 NaN NaN... 2009 June Tuesday
65
[5rowsx7columns]
Consumption 1391.05
Wind 81.229
Solar 160.641
Wind+Solar 241.87
Year 2015
Month
OctoberWee
kday_Name Friday
Name:2015-10-0200:00:00, dtype:object
66
RESULT
67