Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
91 views70 pages

Fdsa Lab Manual Final

The document is a laboratory manual for the Data Science and Analytics course at Indra Ganesan College of Engineering, detailing course objectives, experiments, and outcomes. It covers practical applications using Python libraries such as Numpy and Pandas for data manipulation, visualization, and statistical analysis. The manual includes specific programming examples and procedures for various data science tasks.

Uploaded by

mohanprabhuaids
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views70 pages

Fdsa Lab Manual Final

The document is a laboratory manual for the Data Science and Analytics course at Indra Ganesan College of Engineering, detailing course objectives, experiments, and outcomes. It covers practical applications using Python libraries such as Numpy and Pandas for data manipulation, visualization, and statistical analysis. The manual includes specific programming examples and procedures for various data science tasks.

Uploaded by

mohanprabhuaids
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

INDRA GANESAN COLLEGE OF ENGINEERING

TRICHY- 12

Department of Artificial Intelligence and Data Science

B.E.(EEE)

B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE


IV SEMESTER

REGULATION R-2021

AD3411 DATA SCIENCE AND ANALYTICS LABORATORY


INDRA GANESAN COLLEGE OF ENGINEERING
TRICHY- 12

Department of Artificial Intelligence and Data Science

B.TECH ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

IV SEMESTER
REGULATION R-2021

AD3411 DATA SCIENCE AND ANALYTICS LABORATORY

LABORATORY MANUAL

Preparedby Approved by Principal

2
AD3411 DATA SCIENCE AND ANALYTICS LABORATORY LTPC
004 2
COURSE OBJECTIVES:
 To develop data analytic code in python
 To be able to use python libraries for handling data
 To develop analytical applications using python
 To perform data visualization using plots

LIST OF EXPERIMENTS
Tools: Python, Numpy, Scipy, Matplotlib, Pandas, statmodels, seaborn, plotly, bokeh
Working with Numpy arrays.

1. Working with Pandas data frames


2. Basic plots using Matplotlib
3. Frequency distributions, Averages, Variability
4. Normal curves, Correlation and scatter plots, Correlation coefficient
5. Regression
6. Z-test
7. T-test
8. ANOVA
9. Building and validating linear models
10. Building and validating logistic models
11. Time series analysis

COURSE OUTCOMES:

Upon successful completion of this course, students will be able to:

CO DESCRIPTION EXP.NO.
1. Write python programs to handle data using Numpy and 1,2
Pandas

2. Perform descriptive analytics 3,4

3. Perform data exploration using Matplotlib 5,6

4. Perform inferential data analytics 7,8

5. Build models of predictive analytics 9,10,11,12

3
CONTENTS

S.NO Title of the Experiment Remarks


1 Working with Numpy arrays

2 Working with Pandas dataframes

3 Basic plots using Matplotlib


4 Frequency distributions, Averages, Variability
5 Normalcurves ,Correlation and scatterplots,
Correlation coefficient
6 Regression
7 Z-test
8 T-test
9 ANOVA
10 Building and validating linear models
11 Building and validating logistic models
12 Time series analysis

1
Ex.No:01
BASICPLOTSUSINGMATPLOTLIB

AIM
To write a program to perform various plot in Matplotlib.

PROCEDURE

 Matplotlib is an amazing visualization library in Python for 2D plots of arrays


 Matplotlib is a multi-platform data visualization library built on NumPy arrays and
designed to work with the broader SciPy stack.
 Most of the Matplotlib utilities lies under the pyplot submodule, and are usually
imported under the plt alias:
Import matplotlib.pyplotasplt
Now the Pyplot package can be referred to as plt.
 There are five key plots that are used for data visualization.

2
PROGRAM#1
# Line graph example
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
y1 = [1, 3, 5, 3, 1, 3, 5, 3, 1]
y2 = [2, 4, 6, 4, 2, 4, 6, 4, 2]
plt.plot(x, y1, label="line L")
plt.plot(x, y2, label="line H")
plt.xlabel("x axis")
plt.ylabel("y axis")
plt.title("Line Graph Example")
plt.legend()
plt.show()

PROGRAM#2
#BarchartExample
import matplotlib.pyplot as plt
# The index 4 and 6 demonstrate overlapping cases.
x1=[1,3,4,5,6,7,9]
y1=[4,7, 2, 4, 7, 8, 3]
x2=[2, 4, 6, 8, 10]
y2=[5, 6,2, 6, 2]
plt.bar(x1, y1, label="Blue Bar", color='b')
plt.bar(x2, y2, label="Green Bar", color='g')
plt.plot()
plt.xlabel("bar number")
plt.ylabel("bar height")
plt.title("Bar Chart Example")
plt.legend()
plt.show()

3
PROGRAM#3
#ScatterplotExample
import matplotlib.pyplot as plt
x1=[2,3,4]
y1=[5, 5, 5]
x2=[1, 2, 3, 4, 5]
y2=[2, 3, 2, 3, 4]
y3=[6, 8, 7, 8, 7]
plt.scatter(x1,y1)
plt.scatter(x2, y2, marker='v', color='r')
plt.scatter(x2,y3,marker='^',color='m')
plt.title('Scatter Plot Example')
plt.show()

4
PROGRAM#4
#HistogramExample
import matplotlib.pyplot as plt
import numpy as np
#Creatingdataset
a=np.array([61,63,64,66,68,69,69.5,70,72,
72.5,73,73.5,74,74.5,76,76.2,
76.5,77,77.5,78,78.5,79,79.2,
80,81,82,83,84, 85,87])
#Creatinghistogram
fig, ax = plt.subplots(figsize =(10, 7))
ax.hist(a,bins=[60,65,70,75,80,85,90])
# Show plot
plt.show()

5
RESULT
Thus the implementation of various plots has been executed successfully.

6
Ex.No:02
WORKING WITH NUMPY ARRAYS

AIM
To write a program in python to perform multi dimension array manipulation usingNumpy.

PROCEDURE
What is NumPy?
 NumPy is a Python library used for working with arrays.
 It also has functions for working in domain of linear algebra, fourier transform, and
matrices.
 NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can
use it freely.
 NumPy stands for Numerical Python.

Why Use NumPy?


 In Python we have lists that serve the purpose of arrays, but they are slow to
process.
 NumPy aims to provide an array object that is up to 50x faster than traditional
Python lists.
 The array object in NumPy is called nd array,it provides a lot of supporting functions that
make working with nd array very easy.
 Arrays are very frequently used in data science, where speed and resources are very
important.

Why is NumPy Faster Than Lists?


 NumPy arrays are stored at one continuous place in memory unlike lists, so
processes can access and manipulate them very efficiently.
 This behavior is called locality of reference in computer science.
 This is the main reason why NumPy is faster than lists. Also it is optimized to work
with latest CPU architectures.

7
PROGRAM
(2.1).MatrixAddition
Program:
import numpyasnp
a=np.array([[1,2,3],[4,5,6]])
b=np.array([[10,11,12],[13,14,15]])
c=a+b
print("a=",a)
print("b=",b)
print("Addition of a andb = ", c)

Output:

(2.2).MatrixMultiplication
Program:
import numpy as np
a=np.array([[1,2,3],[4,5,6]])
b=np.array([[10,11,12],[13,14,15]])
c=a*b
print("a=",a)
print("b=",b)
print("Multiplication of a and b = ", c)
Output:

(2.3).ScalarMultiplicationofMatrix
Program:
import numpy as np
a=np.array([[1,2,3],[4,5,6]])
b=a*3
print("a=", a)
print("b=a*3=",b)
Output:

8
(2.4).MatrixTranspose
Program:
import numpy as np
a=np.array([[1,2,3],[4,5,6],[7,8,9]])
b=a.T
print("a = \n", a)
print("Transpose of a=\n",b)
Output:

(2.5).ArrayDatatypeConversion
Program:
import numpy as np
a=np.array([[2.5,3.8,1.5],[4.7,2.9,1.56]])
b=a.astype('int')
print("Thearrayinfloatdatatype=\n",a)
print("The array in int datatype=\n", b)
Output:

(2.6).Stackingofnumpyarrays
Program:
import numpyasnp
a1=np.array([[1,2,3],[4,5,6]])
a2=np.array([[7,8,9],[10,11,12]])
c=np.hstack((a1,a2))
d=np.vstack((a1,a2))
print("Thetwoarraysare:\na1=\n",a1,"\na2=\n",a2)
print("\nHorizontalstacking:\n",c)
print("\nVertical stacking:\n", d)
Output:

9
(2.7).Sequence generation
Program:
import numpy as np
lists = [x for x in range(0,101, 2)]
a=np.array(lists)
print(a)
Output:

(2.8).Sorting anarray
Program:
import numpy as np
a=np.array([[1,4,2],[3,4,6],[0,-1,5]])
print("Array beforesorting")
print(np.sort(a, axis=None))
print("Sorting in row wise :")
print(np.sort(a, axis=1))
print("Sortingincolumnwise:")
print(np.sort(a,axis=0))
Output:

10
RESULT

The various operations on matrix manipulation using numpy module in


Python has been implemented and executed successfully.

11
Ex.No:03
WORKING WITH PANDAS DATAFRAMES

AIM
To perform various operations on dataframe using pandas module in python.

PROCEDURE
What is Pandas?
 Pandas is a Python library used for working with datasets.
 It has functions for analyzing,cleaning,exploring,and manipulating data.
 The name "Pandas" has a reference to both "PanelData", and
"PythonDataAnalysis" and was created by WesMcKinney in 2008.

Why Use Pandas?


 Pandas allows us to analyze big data and make conclusions based on
statistical theories.
 Pandas can clean messy datasets, and make the m readable and relevant.
 Relevant data is very important in data science.
What Can Pandas Do?
 Pandas gives you answers about the data.
 Like:
 Is there a correlation between two or more columns?
 What is average value?
 Max value?
 Min value?
 Pandas are also able to delete rows that are not relevant, or contain wrong
values, like empty or NULL values. This is called cleaning the data.

12
PROGRAM
(3.1).Creatingdataframe
Import pandas as pd
data=[['name1',21,'[email protected]',1234567891],
['name2',26,'[email protected]',1234567892]]
df=pd.DataFrame(data,columns=['NAME','AGE','EMAIL
ID','PHONENUMBER'],index=[1,2])
print(df)

Output:

(3.2).Creatingdataframeusingdictionary
Import pandas as pd
data={'Name':['aa','bb','cc'],'Age':[20, 21,25]}
df=pd.DataFrame(data)
print(df)

Output:

(3.3).Creatingdataframefromaseries
Import pandas as pd
data={'ONE':pd.Series([10,20,30,40],index=[1,2,3,4]),
'TWO':pd.Series([50,60,70,80],index=[1,2,3,4])}
df=pd.DataFrame(data)
print(df)
Output:

(3.4).Sortingthedataframe
Import pandas as pd
data= {'Name' : ['name1','name2','name3'],'Age' : [20, 21, 22]}
df=pd.DataFrame(data)
print("\nDatasetbeforesorting:\n",df)
d_sort1=df.sort_values(by='Name')
print("\nDataset aftersortedby Name:\n", d_sort1)
d_sort2=df.sort_values(by='Age')

13
print("\nDataset aftersortedby Age :\n", d_sort2)
Output:

(3.5).Manipulationofdataframe
(i) Selectionofcolumn:

SourceCode:
Import pandas as pd
data={'ONE':pd.Series([10,20,30,40],index=[1,2,3,4]),
'TWO':pd.Series([50,60,70,80], index=[1,2,3,4])}
df = pd.DataFrame(data)
print(" ---------------------- ")
print(df)
print(" ---------------------- ")
print("Selecting
rowONE")print(df['ONE'])
print(" ---------------------- ")
print("SelectingrowTWO")
print(df['TWO'])
print(" ---------------------- ")
Output:

(ii) Additionofcolumn:
14
Import pandas as pd
data={'ONE':pd.Series([10,20,30,40],index=[1,2,3,4]),
'TWO':pd.Series([50,60,70,80],index=[1,2,3,4])}
df = pd.DataFrame(data)
print(" ---------------------- ")
print("Data Frame beforeadding a new column")
print(df)
print(" ---------------------- ")
df['THREE']=pd.Series([90,100,110,120],index=[1,2,3,4])
print("Data Frame afteradding a newcolumn\n", df)
print(" ---------------------- ")
Output:

(iii) Deletion of column


Import pandas as pd
data={'ONE':pd.Series([0,1,2,3],index=[1,2,3,4]),
'TWO' :pd.Series([4,5,6,7],index=[1,2,3,4])}
print(" -------------------- ")
df=pd.DataFrame(data)
print("OriginalDataFrame:\n",df)
print(" ------------------------------------------------------------------- ")
df.drop([‘ONE’],axis=1)ordf.drop(df.columns[2],axis=1,inplace=True)ordf.drop(df.loc[:,’name’:’mobilenumber’],a
xis=1,inplace=True)or
del df['ONE']
ordf.pop(‘ONE
’)
print("DataFrame afterdeleting acolumn :\n", df)
print(" --------------------- ")
Output:

15
(iv) Selectionofrows

SourceCode:
Import pandas as pd
data={'ONE':pd.Series([0,1,2,3],index=['a','b','c','d']),
'TWO':pd.Series([4,5,6,7],index=['a','b','c','d'])}
print(" --------------------- ")
df = pd.DataFrame(data)
print("DataFrame:\n",df)
print(" --------------------- ")
print("row'c':")
print(df.loc['c'])
print(" --------------------- ")
Output:

(v) Addition of rows


Import pandas as pd
df1=pd.DataFrame([[1,2],[3,4]],columns=['a','b'])
df2=pd.DataFrame([[5,6],[7,8]],columns=['a','b'])
print(" ------------------- ")
print("df1 :")
print(df1)
print(" ------------------- ")
print("df2 :")
print(df2)
print(" ------------------- ")
16
print("df1 + df2 :")
df1=df1.append(df2)
print(df1)

Output:

(vi) Deletionof rows

Import pandas as pd
df=pd.DataFrame([[1,2],[3,4]],columns=['a','b'])
print("DataFrame:")
print(df)
df=df.drop(0)or
df1=df[df[‘age’]>20]
print(df1)
print("DataFrame afterdeleting therow 0 :")
print(df)
Output:

17
RESULT

The various operations on data frame using pandas module in python has been implemented
and executed successfully.

18
Ex.No:04
READING DATA FROM VARIOUS SOURCES

AIM

To Read data from Text file, Excel file and webpage using python functions.

PROCEDURE

 To access data from the CSV file, we require a function read_csv() that retrieves data
in the form of the dataframe.

 pd.read_csv(filepath_or_buffer,sep=’,’,header=’infer’,index_col=None,us
ecols=None,engine=None,skiprows=None,nrows=None)

 Toaccessdatafromtheexcelfile,werequireafunctionread_excel()thatretrievesdataintheform
ofthedataframe.

 To access data from the webpage, we require a function read_html() that retrieves data in
the form of the dataframe.

19
PROGRAM
(4.1)Reading datafromatextfile

T=open(r'Data.txt')
print(T.read())

Output:

(4.2)ReadingtheCSVfile

Import pandas as pd
data=pd.read_csv(r'Data.csv')
print(data)
Output:

(4.3)Readingtheexcelfile

pip install openpyxl

Program
Import pandas as pd
data=pd.read_excel(r'Data.xlsx')
print(data)
Output:

20
(4.4)Readingfromweb
Pip install lxml
Pip install html5lib
pip install bs4

Program:
import pandas as pd
url=https://en.wikipedia.org/wiki/Iris_flower_data_set
df=pd.read_html(url)
print(df)
Output:

[ DatasetorderSepallength...Petalwidth Species

0 1 5.1... 0.2 I.setosa

1 2 4.9... 0.2 I.setosa

2 3 4.7... 0.2 I.setosa

3 4 4.6... 0.2 I.setosa

4 5 5.0... 0.3 I.setosa

.. ... ...... ... ...

145 146 6.7 ... 2.3 I.virginica

146 147 6.3 ... 1.9 I.virginica

147 148 6.5 ... 2.0 I.virginica

148 149 6.2 ... 2.3 I.virginica

149 150 5.9 ... 1.8 I.virginica

[150rowsx6columns]

21
RESULT

The operation on reading a data from the various resourses in python has been
implemented and executed successfully.

22
Ex.No:05a
IMPLEMENTATION OF FREQUENCY
DISTRIBUTIONS

AIM
To write a python program for to implement a concept of Frequency distribution.

PROCEDURE
 The frequency (f) of a particular value is the number of times the value occurs in thedata.
 The distribution of a variable is the pattern of frequencies, meaning the set of all
possible values and the frequencies associated with these values.
 Frequency distributions are portrayed as frequency tables or charts.
 Frequencydistributions can show either the actual number of observations falling in each
range or the percentage of observations. In the latter instance, the distribution is called a
relative frequency distribution.
 Frequency distribution is implemented by the function crosstab().
 Syntax:pandas.crosstab(index,columns,values=None,rownames=None,colnames=None,a
ggfunc=None,margins=False,margins_name=’All’,dropna=True,normalize=False)
 Arguments:
 index:array-like,Series,orlistofarrays/Series,Values to groupby in the rows.
 columns:array-like,Series,orlistofarrays/Series,Values to groupby in the columns.
 values:array-like,optional,array of values to aggregate according to the factors.
Requires`aggfunc`be specified.
 rownames:sequence,default None, If passed,must match number of row arrays
passed.
 colnames:sequence,default None,If passed,must match number of column arrays
passed.
 aggfunc:function,optional,If specified,requires`values`be specified as well.
 margins:bool,defaultFalse,Addrow/columnmargins(subtotals).
 margins_name:str,default‘All’,Name of the row/column that will contain the totals
when margins is True.
 dropna:bool,default True,Do not include columns who seen tries a reallNaN.

23
PROGRAM

Import pandas as
pdData=pd.Series([1,1,1,1,2,3,3,3,3,3,4,4,4,5])
Print(Data.value_counts())
PrintData.value_counts(sort=False)
df=pd.DataFrame({‘Grade’:[‘A’,’A’,’A’,’A’,’A’,’B’,’B’,’B’,’B’,’C’,’
D’,’D’],’Age’:[18,18,18,19,19,,20,18,18,19,19],’Gender’:[‘M’,’M’,’
F’,’F’,’F’,’M’,’M’,’F’,’M’,’F’]})
Print(df)
Print(pd.crosstab(index=df[‘Grade’],columns=’count’))
Print(pd.crosstab(index=df[‘Age’],columns=’count’))
tab=pd.crosstab(index=df[‘Age’],columns=’count’))
Print(tab/tab.sum())
Print(pd.crosstab(index=df[‘Age’],columns=df[’Grade’]))
Output:

24
RESULT

The implementation of the concept of frequency distribution in python has been executed successfully.

25
Ex.No:05b
IMPLEMENTATIONOFAVERAGESANDVARIABILI
TY

AIM
To write a python program for to implement a concept of Average and Variability.

PROCEDURE
 The mean() function is used to return the mean of the values for the requested axis.
 If we apply this method on a Series object, then it returns a scalar value, which is the
mean value of all the observations in the data frame.
 If we apply this method on a DataFrame object, then it returns a Series object which
contains mean of values over the specified axis.
 DataFrame.mean(axis=None,skipna=None,level=None,numeric_only=None,**kwargs)
 Parameters
 axis:{index(0),columns(1)}.
This refers to the axis for a function that is to be applied.
 skipna:It excludes all the null values when computing result.
 level: It counts along with a particular level and collapsing into a Series if the axis is a
MultiIndex(hierarchical),
 numeric_only:It includes only int,float,Boolean columns.If None,it will attempt to
use everything, then use only numeric data. Not implemented for Series.
 It returns the mean of the Series or DataFrame if the level is specified.
 The Pandas std() is defined as a function for calculating the standard deviation of the
given set ofnumbers,DataFrame, column,androws.
 To calculate the standard deviation, we need to import the package named "statistics" for
the calculation of median.
 Series.std(axis=None,skipna=None,level=None,ddof=1,numeric_only=None,**kwargs)
 It returns Series or DataFrame if the level is specified.

26
PROGRAM
AVERAGEANDVARIABILITYusingnumpyandList:

1. import numpy as np
list=[2, 4,4,4, 5,5, 7,9]
print(np.average(list)) # Calculating average using average()
print(np.var(list)) # Calculating variance using var()
print(np.std(list)) #Calculatingstandarddeviationusingvar()
OUTPUT
5.0
4.0
2.0

2. import numpy as np
x=np.arange(5)#Originalarray
print(x)
r11=np.mean(x)
r12=np.average(x)
print("\nMean:",r11,r12)
r21= np.std(x)
r22=np.sqrt(np.mean((x-np.mean(x))**2))
print("\nstddev:",r21,r22)
r31=np.var(x)
r32=np.mean((x-np.mean(x))**2)
print("\nvariance: ", r31, r32)
OUTPUT
[01234]
Mean:2.02.0
stddev:1.41421356237309511.4142135623730951
variance:2.02.0

AVERAGEANDVARIABILITYusingnumpyandDictionary
Import numpy as np
dicti={'a':20,'b':32,'c':12,'d':93,'e':84}#creatingourtestdictionary
listr= []
for value in dicti.values()
:listr.append(value)
mean=np.mean(listr)st
d = np.std(listr)
print(mean)
print(std)

OUTPU
T
48.2
33.63569532505609

27
AVERAGEANDVARIABILITYusingPandas
1. import pandas as
2. pd
s=pd.Series(data=[5,9,8, 5,7,8, 1,2,3,4,5, 6,7,8, 9,5,3])
print(s)
print(s.mean())# finding the mean
print(s.std()) #findingtheStandarddeviation
OUTPUT
0 5
1 9
2 8
3 5
4 7
5 8
6 1
7 2
8 3
9 4
10 5
11 6
12 7
13 8
14 9
15 5
16 3
dtype: int645.588235294117647
2.450990196058824

3. import pandas as
4. pd
df=pd.DataFrame({'ID':[114,345,157788,5626],'Discount':[10,20,10,50]})
print(df)
print(df.mean())# finding the mean
print(df.std()) # finding the Standard deviation
OUTPUT
IDDiscount0
114 10
1 345 20
215778810
3 5626 50
ID 40968.25
Discount 22.50
dtype:float64
ID 77921.427966
Discount 18.929694
dtype:float64

28
RESULT

The implementation of the concept of average and variability in python has been executed successfully.

29
Ex.No:06a
IMPLEMENTATIONOFNORMALCURVES

AIM
To write a python program for to implement a concept of NormalDistribution.

PROCEDURE
In a normal distribution, data are symmetrically distributed with no skew. Most values cluster
around a central region, with values tapering off as they go further away from the center.
The measures of central tendency (mean, mode, and median) are exactly the same in a normal
distribution.
In Python, we implement normal distribution using norm().
scipy.stats.norm() is a normal continuous random variable.
It is inherited from the generic methods as an instance of the rv_continuous class.
It completes the methods with details specific for this particular distribution.
Parametersareusedinnorm()is,
q: lower and upper tail probability
x: quantiles
loc: [optional] location parameter. Default = 0
scale: [optional] scale parameter. Default = 1
size: [tuple of ints, optional] shape or random variates.
moments: [optional] composed of letters ['mvsk']; 'm' = mean, 'v' = variance, 's' = Fisher’s skew
and 'k' = Fisher’s kurtosis (default = 'mv').
norm() function returns normal continuous random variable.

30
PROGRAM#1
import numpy as np
import matplotlib.pyplot as plt

# Create a function for normal


distribution
def normal_dist(x, mean, sd):
prob_density = (1 / (np.sqrt(2
* np.pi) * sd)) * np.exp(-0.5 *
((x - mean) / sd) ** 2)
return prob_density

# Generate data points


x = np.linspace(1, 50, 200)

# Calculate mean and standard


deviation
mean = np.mean(x)
sd = np.std(x)

# Apply function to the data


pdf = normal_dist(x, mean, sd)

# Plotting the results


plt.plot(x, pdf, color='red')
plt.xlabel('Data points')
plt.ylabel('Probability Density')
plt.title('Normal Distribution')
plt.show()
OUTPUT

PROGRAM#2
From scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
data=np.arange(1,10,0.01)#Creatingthedistribution
pdf = norm.pdf(data , loc = 5.3 , scale = 1 )
sb.set_style('whitegrid')#Visualizing the distribution
31
sb.lineplot(data, pdf , color = 'black')
plt.xlabel('Heights')
plt.ylabel('Probability Density')
OUTPUT

32
PROGRAM#3
# import required libraries
From scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
# Creating the distribution
data=np.arange(1,10,0.01)
pdf = norm.pdf(data , loc = 5.3 , scale = 1 )
#Probability of height to be under 4.5
ft.prob_1=norm(loc=5.3,scale=1).cdf(4.5)
print(prob_1)
#probabilitythattheheightofthepersonwillbebetween6.5and4.5ft.
cdf_upper_limit= norm(loc= 5.3 , scale=1).cdf(6.5)
cdf_lower_limit = norm(loc = 5.3 , scale = 1).cdf(4.5)
prob_2 = cdf_upper_limit - cdf_lower_limit
print(prob_2)
#probabilitythattheheightofapersonchosenrandomlywillbeabove6.5ft
cdf_value = norm(loc =5.3,scale= 1).cdf(6.5)
prob_3 = 1- cdf_value
print(prob_3)
sb.set_style('whitegrid')
pdf1=sb.lineplot(data,pdf,color='black')
pdf1.axvline(4.5)
pdf1.axvline(6.5)
plt.xlabel('Heights')
plt.ylabel('Probability Density')
plt.legend()
plt.show()
OUTPUT
0.21185539858339675
0.673074931194895
0.11506967022170822

33
RESULT

The implementation of the concept of normal distribution in python has been executed successfully.

34
Ex.No:06b
IMPLEMENTATIONOFCORRELATI
ON

AIM
To write a python program for to implement a concept of Correlation.

PROCEDURE
 Correlation refers to a process for establishing the relationships between two variables.
 To get a general idea about whether or not two variables are related, it is helpful to plot them on
a scatterplot.
 Methods of correlation summarize the relationship between two variables in a single number called
the correlation coefficient. The correlation coefficient is usually represented using the symbol rr, and
it ranges from -1 to +1.
 A correlation coefficient quite close to 0, but either positive or negative, implies little or no
relationship between the two variables.
 A correlation coefficient close to +1 indicates a positive relationship between the two variables, with
increases in one of the variables being associated with increases in the other variable.
 A correlation coefficient close to -1 indicates a negative relationship between two variables, with an
increase in one of the variables being associated with a decrease in the other variable.
 A correlation coefficient can be produced for ordinal, interval, or ratio level variables but has little
meaning for variables measured on a scale that is no more than nominal.
 For ordinal scales, the correlation coefficient can be calculated using Spearman’s rho.
 For interval or ratio level scales, the most commonly used correlation coefficient is Pearson’s rr,
ordinarily referred to as simply the correlation coefficient.
 Pearson correlation is implemented by the stats.pearsonr() function.

35
PROGRAM

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import pandas as pd
experience=[1,3,4,5,5,6,7, 10,11,12,15,20,25,28,30,35]
salary=[20000,30000, 40000,45000,55000,60000,80000,100000, 130000,150000,
200000,230000,250000,300000,350000,400000]
data={'expr':[1,3,4,5,5,6,7,10,11,12,15,20,25,28,30,35],
'sal':[20000,30000,40000,45000,55000,60000,80000,100000,130000,150000,
200000, 230000, 250000, 300000, 350000,
400000],'age':[20,30,36,40,28,40,50,54,49,47,45,39,29,40,30,
38]}
df=pd.DataFrame(data)
corr = stats.pearsonr (experience, salary)[0]
print("The rvalueis:",corr)
if(corr>=0.7):
print("Positivelystrongrelation")
elif(corr<=-0.7):
print("Negatively strong relation")
else:
print("No relation")
a=np.corrcoef(experience, salary)
plt.subplot(1,2,1)
plt.scatter(experience,salary,c="blue")
plt.xlabel("experience")
plt.ylabel("salay")
print("ThecorrelationMatrixis:\n",a)
r=df['expr'].corr(df['age'])
print("Thervalueis:",r)
if(r>=0.7):
print("Positivelystrongrelation")
elif(r<=-0.7):
print("Negatively strong relation")
else:
print("No relation")
plt.subplot(1,2,2)
plt.scatter(df.expr,df.age,c="red")
plt.xlabel("experience")
plt.ylabel("age")
plt.show()

OUTPUT
Thervalueis:0.9929845761480397P
ositivelystrongrelation
36
ThecorrelationMatrixis:
[[1. 0.99298458]
[0.992984581. ]]
Thervalueis:0.010682427800601435N
o relation

37
RESULT

The implementation of the concept of correlation coefficient in python has been executed successfully.

38
Ex.No:07
IMPLEMENTATIONOFLEASTSQUAREREGRESSIONMETHOD

AIM
To write a python program for to implement a concept of Least square Regression
linemethod.

PROCEDURE
 The least-squares method is a crucial statistical technique used to find a regression line or a best-fit
line for a given pattern. This method is described by an equation with specific parameters.
 The method of least squares is widely used in evaluation and regression. In regression analysis, this
method is considered a standard approach for approximating sets of equations that have more
equations than unknowns.
 The least-squares method defines the solution for minimizing the sum of squares of deviations or
errors in the results of each equation. The formula for the sum of squares of errors helps to find the
variation in observed data.
 The least-squares method is often applied in data fitting.
 The best-fit result aims to reduce the sum of squared errors or residuals, which are the differences
between the observed or experimental values and the corresponding fitted values given in the model.
 We can derive the line of best fit using the formula:
y=ax+by=ax+b

39
PROGRAM#1
Import numpy ascnp
import matplotlib.pyplot as plt
def estimate_coef(x,y):
# number of observations/points
n= np.size(x)
# mean of x and y vector
m_x= np.mean(x)
m_y=np.mean(y)
#Calculatingcross-deviationanddeviationaboutx
SS_xy= np.sum(y*x) -n*m_y*m_x
SS_xx=np.sum(x*x)-n*m_x*m_x
# Calculating regression coefficient
sb_1= SS_xy/ SS_xx
b_0=m_y-b_1*m_x
return(b_0, b_1)
def plot_regression_line(x,y,b):
# plotting the actual points as scatter plot
plt.scatter(x,y,color="m",marker="o",s=30)
# predicted response vector
y_pred=b[0] +b[1]*x
# plotting the regression
lineplt.plot(x,y_pred,color="g"
)
#puttinglabels
plt.xlabel('x')
plt.ylabel('y')
#functiontoshowplot
plt.show()
#observations/data
x=np.array([0,1, 2,3,4,5, 6,7,8,9])
y=np.array([1,3,2,5,7,8,8,9,10,12])
# estimating coefficient
sb=estimate_coef(x,y)
print("Estimatedcoefficients:\na={}\nb={}".format(b[0],b[1]))
print("Leastsquareregressionequationis:","{0:.3f}".format(b[0]),"+x","{0:.3f}".format(b[1]))
#plottingregressionline
plot_regression_line(x, y,b)
OUTPUT
Estimatedcoefficients:
a=1.2363636363636363
b=1.1696969696969697
Leastsquareregressionequationis:1.236+x1.170

40
PROGRAM#2
Import numpy as np
Import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
x=np.array([5,15,25,35,45,55]).reshape((-1,1))
y=np.array([5,20,14,32,22,38])
model = LinearRegression().fit(x, y)
r_sq = model.score(x, y)
print(f"coefficient of determination: {r_sq}")
y_pred= model.predict(x)
print(f"predicted response:\n{y_pred}")
plt.scatter(x,y, color = "m",marker = "o", s = 30)
plt.plot(x,y_pred,color ="green")
plt.xlabel('x')
plt.ylabel('y')
plt.show()
OUTPUT
coefficientofdetermination:0.7158756137479542predictedr
esponse:
[8.3333333313.7333333319.1333333324.5333333329.9333333335.33333333]

41
RESULT

The implementation of the concept of least square regression in python has been executed
successfully.

42
Ex.No:08
IMPLEMENTATIONOFMULTIPLEREGRESSI
ON

AIM

To write a python program for to implement a concept of MultipleRegression Method.

PROCEDURE

Create an instance of the class LinearRegression,which will represent the regression


model:

model=LinearRegression()

This statement creates the variable model as an instance of LinearRegression. It can


provide several optional parameters to LinearRegression:

fit_intercept is a Boolean that, if True, decides to calculate the intercept 𝑏₀ or,if


False,considers it equal to zero.It defaults toTrue.

normalize is a Boolean that, if True, decides to normalize the input variables. It


defaults to False, in which case it doesn’t normalize the input variables.

copy_X is a Boolean that decides whether to copy (True) or overwrite the


input variables (False).It’s True by default.

n_jobs is either an integer or None. It represents the number of jobs used in


parallel computation. It defaults to None, which usually means one job. -
1means to use all available processors.

It’s time to start using the model.

First,you need to call.fit() on model:model.fit(x,y)

With .fit( ), you calculate the optimal values of the weights 𝑏₀ and 𝑏₁, using the existing
input and output,xandy,as the arguments.

In other words,.fit() fits the model.It returns self,which is the variable model itself.

43
PROGRAM

Import numpy as np
Import matplotlib.pyplot as plt
From sklearn.linear_model import LinearRegression
x=[[0,1],[5,1],[15,2], [25,5],[35,11],[45,15],[55,34],[60,35]]
y=[4,5,20, 14,32,22,38,43]
x, y = np.array(x), np.array(y)
model = LinearRegression().fit(x, y)
r_sq = model.score(x, y)
print(f"coefficient of determination: {r_sq}")
y_pred= model.predict(x)
y_pred = model.intercept_ + np.sum (model.coef_* x, axis=1)
print(f"predictedresponse:\n{y_pred}")

OUTPUT
coefficientofdetermination:0.8615939258756776predic
tedresponse:
[5.777604768.01295312.7386749717.974447923.9752972829.4660957
38.7822763341.27265006]

44
RESULT

The implementation of the concept of multiple regression in python has been executed
successfully.

45
Ex.No:09
IMPLEMENTATIONOFZTEST

AIM
To write a python program for to implement a concept of z test.

PROCEDURE

 A Z-test is a statistical test used to determine whether two population means are different when the
variances are known and the sample size is large.
 The test statistic is assumed to have a normal distribution, and nuisance parameters such as standard
deviation should be known in order for an accurate Z-test to be performed.
 You can use the ztest() function from the statsmodels package to perform one-sample and two-sample
Z-tests in Python.
statsmodels.stats.weightstats.ztest(x1,x2=None,value=0)
where:
x1:values for the first sample
x2:values for the second sample(if performing a two sample z-test)
value: mean under the null (in one sample case) or mean difference (in two
sample case)

46
PROGRAM
Import math
Import numpy as np
From numpy.random import randn
From stats models.stats.weight stats import ztest
#Generatearandomarrayof50numbershavingmean110andsd15# similarto
the IQ scores data weassumeabove
mean_iq=110
sd_iq=15/math.sqrt(50)
alpha =0.05
null_mean =100
print(randn(50))
data = sd_iq*randn(50)+mean_iq
print(data)
#printmean andsd
print('mean=%.2fstdv=%.2f'%(np.mean(data),np.std(data)))
#nowweperformthetest.Inthisfunction,wepasseddata,inthevalueparameter
#wepassedmeanvalueinthenullhypothesis,inalternativehypothesiswecheckwhetherthemeanis larger
ztest_Score,p_value=ztest(data,value=null_mean,alternative='larger')
# the function outputs a p_value and z-score corresponding to that value, we compare the# p-
value withalpha,ifitis greaterthan alphathen we donot nullhypothesis
# else we reject it.
print(ztest_Score)
print(p_value)
if(p_value<alpha):
print("Reject Null Hypothesis")
else:
print("Retain NULL Hypothesis")
OUTPUT
[109.32317127107.5180639110.34687099109.24347939110.53062543
107.52136918108.82945441111.76442439109.52318027110.82466106
115.26944919107.92161923107.72658274110.41249653108.29400528
109.14010895108.72487358109.66275854111.32339348109.04384837
109.86205757107.5119657110.81380703112.01224839110.78607939
109.92631499111.87247694111.44013795107.80153879110.97192483
112.59626892105.76061577106.49222008112.77586897108.94970948
107.82422582105.03208528106.02078487109.01139522110.93278094
108.19795152107.90859201107.45543581106.06955274110.7309573
110.0331299107.9865958109.69378913111.6672574106.72201292]
mean=109.36
stdv=2.0432.15274767
2088914
4.0423380710270466e-227
RejectNullHypothesis

47
RESULT

The implementation of the concept of z-test in python has been executed successfully.

48
Ex.No:10
IMPLEMENTATION OF T
TEST

AIM

To write a python program for to implement a concept of various type of t test.

PROCEDURE

Attestisa statistical test that is used to compare the means of two groups.

It is often used in hypothesis testing to determine whether a process or treatment actually


has an effect on the population of interest,or whether two groups are different from one
another.

Implement the t test using the function,

(1) scipy.stats.ttest_1samp(a, popmean, axis=0,


nan_policy='propagate',alternative='two-sided', *,keepdims=False)
Calculate theT-test for the mean of ONE group of scores.

This is a test for the null hypothesis that the expected value (mean) of a sample of
independent observation sais equal to the given population mean,popmean.

Returns:Thet-statistic,pvalue,dfvalue,CIValue

(2) scipy.stats.ttest_ind(a, b, axis=0, equal_var=True,


nan_policy='propagate',permutations=None,random_state=None,alternative='t
wo-sided',trim=0)
Calculate the T-test for the means of two independent samples of scores.

This is a test for the null hypothesis that 2 independent samples have identical
average(expected) values. This test assumes that the populations have identical variances by
default.

Returns:Thet-statistic,pvalue

(3) scipy.stats.ttest_rel(a, b, axis=0,


nan_policy='propagate',alternative='two-sided',
*,keepdims=False)
Calculate the t-test on TWO RELATED samples of scores,a and b.

This is a test for the null hypothesis that two related or repeated samples have identical
average(expected) values.

Returns:Thet-statistic,pvalue,dfvalue,CIValue
49
PROGRAM#1

One sample t-
testData:
Systolicbloodpressuresof14patientsaregivenbelow:
183,152,178,157,194,163,144,114,178,152,118,158,172,138
Test,whetherthepopulationmean,islessthan165
Hypothesis
H0:Thereisnosignificantmeandifferenceinsystolicbloodpressure.i.e.,μ=165H1:Thepop
ulationmeanisless than165.i.e.,μ < 165
Code:
sys_bp=[183,152,178,157,194,163,144,114,178,152,118,158,172,138]
mu=165
from scipy import statst_
value,p_value=stats.ttest_1samp(sys_bp,mu)
one_tailed_p_value=float("{:.6f}".format(p_value/2))
print('Test statistic is %f'%float("{:.6f}".format(t_value)))
print('p-value for one tailed test is %f'%one_tailed_p_value)
alpha= 0.05
if one_tailed_p_value<=alpha:
print('Conclusion','n','Since p value(=%f)'%p_value,'<','alpha(=%.2f)'%alpha,'''Wereject
the null hypothesis H0. So we conclude that there is no significant meandifference in systolic
blood pressure. i.e., μ = 165 at %.2f level ofsignificance'''%alpha)
else:
print('Conclusion','Sincep-value(=%f)'%one_tailed_p_value,'>','alpha(=%.2f)'%alpha,'Wedonotreject
thenull hypothesis H0.')
OUTPUT
Teststatisticis-1.243183
p-valueforonetailedtestis0.117877
ConclusionSincep-value(=0.117877)>alpha(=0.05)WedonotrejectthenullhypothesisH0.

PROGRAM#2

Two independent sample t-


testData:
Comparetheeffectivenessofammoniumchlorideandurea,onthegrainyieldofpaddy,anexperimentwas
conducted.Theresultsaregivenbelow:
Ammoniumchlorid
e(X1)13.4 10.9 11.2 11.8 14 15.3 14.2 12.6 17 16.2 16.5
15.7
Urea(X2) 12 11.7 10.7 11.2 14.8 14.4 13.9 13.7 16.9 16 15.6 16
Hypothesis
H0: The effect of ammonium chloride and urea on grain yield of paddy are equal i.e., μ1 = μ2H1:
The effect of ammonium chloride and urea on grain yield of paddy is not equal i.e., μ1 ≠ μ2Code
Ammonium_chloride=[13.4,10.9,11.2,11.8,14,15.3,14.2,12.6,17,16.2,16.5,15.7]

50
Urea=[12,11.7,10.7,11.2,14.8,14.4,13.9,13.7,16.9,16,15.6,16]
from scipy import statst_
value,p_value=stats.ttest_ind(Ammonium_chloride,Urea)
print('Test statistic is %f'%float("{:.6f}".format(t_value)))
print('p-value fortwo tailedtestis %f'%p_value)
alpha=0.05
ifp_value<=alpha:
print('Conclusion','n','Since p-value(=%f)'%p_value,'<','alpha(=%.2f)'%alpha,'''We reject
thenull hypothesis H0. So we conclude that the effect of ammonium chloride and urea on
grainyieldofpaddyarenotequal i.e., μ1 =μ2 at%.2flevelofsignificance.'''%alpha)
else:
print('Conclusion','n','Since p-value(=%f)'%p_value,'>','alpha(=%.2f)'%alpha,'''We do
notreject thenull hypothesis H0.’’’)
OUTPUT
Teststatisticis0.184650
p-valuefortwotailedtestis0.855195
ConclusionnSincep-value(=0.855195)>alpha(=0.05)WedonotrejectthenullhypothesisH0.

PROGRAM#3

Tworelatedsamplet-test:
ElevenschoolboysweregivenatestinStatistics.TheyweregivenaMonth’stuitionandsecondtests were held
at the end of it. Do the marks give evidence that the students have benefitedfromtheexamcoaching?
Marksin 1sttest:2320192118201817231619
Marksin 2ndtest:241922182022 202023 2018
Hypothesis
H0: The students have not benefited from the tuition class. i.e., d =
0H1:The studentshavebenefitedfrom the tuitionclass. i.e.,d<0
Where,d=x-y;disthedifferencebetweenmarksinthefirsttest(sayx)andmarksinthesecondtest (sayy).
Code
alpha=0.05
first_test=[23,20,19,21,18,20,18,17,23,16,19]
second_test=[24,19,22,18,20,22,20,20,23,20,18]
from scipy import statst_
value,p_value=stats.ttest_rel(first_test,second_test)
one_tailed_p_value=float("{:.6f}".format(p_value/2))
print('Test statistic is %f'%float("{:.6f}".format(t_value)))
print('p-value for one_tailed_test is %f'%one_tailed_p_value)
alpha= 0.05
if one_tailed_p_value<=alpha:
print('Conclusion','n','Sincep-value(=%f)'%one_tailed_p_value,'<','alpha(=%.2f)'%alpha,'''Wereject the
null hypothesis H0.So we conclude that the students have benefited by the tuitionclass. i.e.,d= 0 at
%.2flevel ofsignificance.'''%alpha)
else:

51
print('Conclusion','n','Since p-value(=%f)'%one_tailed_p_value,'>','alpha(=%.2f)'%alpha,'''Wedo not
reject the null hypothesis H0. So we conclude that the students have not benefited by thetuition
class.i.e.,d= 0at %.2flevel ofsignificance.'''%alpha)
OUTPUT
Teststatisticis-1.707331
p-valueforone_tailed_testis0.059282
ConclusionnSincep-value(=0.059282)>alpha(=0.05)WedonotrejectthenullhypothesisH0.So we
conclude that the students have not benefited by the tuition class. i.e., d = 0 at 0.05
levelofsignificance.

52
RESULT

The implementation of the concept of various type of t test in python has been executed successfully.

53
Ex.No:11
IMPLEMENTATION OF ANOVA

AIM
To write a python program for to implement a concept of ANOVA and Ftest.

PROCEDURE
 An ANOVA test is a statistical test used to determine if there is a statistically significant
difference between two or more categorical groups by testing for differences of means
using a variance.
Another Key part of ANOVA is that it splits the independent variable into two or more
groups.
It is implemented by the f_oneway().
scipy.stats.f_oneway(*samples,axis=0)
Performone-wayANOVA.

The one-way ANOVA tests the null hypothesis that two or more groups have the same
population mean. The test is applied to samples from two or more groups, possibly with
differing sizes.
Parameters:
sample1,sample2,…array_like
The sample measurements for each group.There must be at least two arguments.If the arrays
are multidimensional, then all the dimensions of the array must be the same except for axis.
axisint,optional
Axis of the input arrays along which the test is applied.Default is0.
Returns:Fstatistic value and pvalue

54
PROGRAM#1
#Importing library
fromscipy.statsimportf_oneway
#Performancewheneachoftheengine
#oil is applied
performance1=[89,89,88,78,79]
performance2=[93,92,94,89,88]
performance3=[89,88,89,93,90]
performance4=[81,78,81,92,82]
#Conducttheone-wayANOVA
f,p_value=f_oneway(performance1, performance2, performance3, performance4)
print(f,p_value)
OUTPUT
4.625000000000002 0.016336459839780215

PROGRAM#2
import random
import numpy as np
import pandas as pd
import patsy
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import stats models.api as sm
from statsmodels.stats.anova import AnovaRM
from stats models.regression.mixed_linear_model import MixedLMResults
fromscipyimport stats
import seaborn as sns
subs_list=['01','02','03','04','05','06','07','08','09','10']
task_list = ['task1', 'task2',
'task3']condition_list=['pre','post']
#readdataintodataframe
df_2way_rm = pd.DataFrame(columns=["sub_id", "task", "condition",
"my_value"])my_row = 0
#uniquesubject-
IDasadditionalfactorsub_id= 0
for sub in
subs_list:sub_id=su
b_id+1
forind_t,taskinenumerate(task_list):
for ind_c, con in
enumerate(condition_list):#generaterandom
valuehereasexample
my_val = np.random.normal(ind_t + ind_c, 1,
1)[0]df_2way_rm.loc[my_row] = [sub_id, task, con,
my_val]my_row = my_row + 1
#conductANOVAusingmixedlmpri
nt(df_2way_rm)
55
my_model_fit=AnovaRM(df_2way_rm,'my_value','sub_id',within=['task','condition']).fit()

print(my_model_fit.anova_table)
OUTPUT
FValue NumDF Den DF
Pr>Ftask 24.844085 2.0 18.0
0.000007
condition 7.293545 1.0 9.0 0.024367
task:condition0.774146 2.0 18.0 0.4758

56
RESULT

The implementation of the concept of ANOVA in python has been executed successfully.

57
Ex.No:12
IMPLEMENTATION OF LOGISTIC
REGRESSION

Aim

Use logistic regression model for Predicting if a person would buy life insurance based on hisage.

PROCEDURE

 Logistic regression is one of the most popular machine learning algorithms, which comes under the
supervised learning technique.
 It is used for predicting the categorical dependent variable using a given set of independent variables.
 Logistic regression predicts the output of a categorical dependent variable. Therefore, the outcome
must be a categorical or discrete value.
 It can be either Yes or No, 0 or 1, True or False, etc. Instead of giving the exact values as 0 and 1, it
gives probabilistic values that lie between 0 and 1.
 Logistic regression is much similar to linear regression except for how they are used.
 Linear regression is used for solving regression problems, whereas logistic regression is used for
solving classification problems.
 In logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).
 Logistic regression can be used to classify observations using different types of data and can easily
determine the most effective variables used for classification.

The above equation is the final equation for LogisticRegression.

58
PROGRAM

Import pandas as pd
From matplotlib import pyplot as plt
df = pd.read_csv("insurance_data.csv")
df.head()
plt.scatter(df.age,df.bought_insurance,marker='+',color='red')
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test
=train_test_split(df[['age']],df.bought_insurance,train_size=0.8)
fromsklearn.linear_modelimportLogisticRegression
model = LogisticRegression()
model.fit(X_train,y_train)
X_test
y_predicted = model.predict(X_test)
model.predict_proba(X_test)
#model.predict(20)
#calculating accuracy
model.score(X_test,y_test)
y_predicted
model.coef_mode
l.intercept_

59
RESULT

Thus logistic regression model for Predicting if a person would buy life insurance based on his

age was done successfully

60
Ex.No:13
IMPLEMENTATIONOFTIMESERIESANALYS
IS

AIM
To perform Time Series Analysis of Open Power System Data.

PROCEDURE
 Time series is a series of data points in which each data point is associated with a timestamp.
 A simple example is the price of a stock in the stock market at different points of time on a given day.
 Another example is the amount of rainfall in a region at different months of the year.
 A time series is a set of observations that are collected after regular intervals of time. If plotted, the
time series would always have one of its axes as time.
 Time Series Analysis in Python considers that data collected over time might have some structure;
hence it analyzes time series data to extract its valuable characteristics.
 The time variable/feature is the independent variable and supports the target variable to predict the
results.
 Time Series Analysis (TSA) is used in different fields for time-based predictions, such as:
 Weather forecasting models
 Stock market predictions
 Signal processing
 Engineering domain – Control Systems and Communications Systems

61
PROGRAM
import numpy as np
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
# load time series
datasetdf_power=
pd.read_csv("https://raw.githubusercontent.com/jenfly/opsd/master/opsd_ger
many_daily.csv")
print(df_power.columns)
print(df_power.shape)
print(df_power.tail(10))
print(df_power.dtypes)

#convertobjecttodatetimeformat
df_power['Date'] = pd.to_datetime(df_power['Date'])
print(df_power.dtypes)
df_power = df_power.set_index('Date')
print(df_power.tail(3))
print(df_power.index)
# Add columns with year, month, and weekday
namedf_power['Year']=df_power.index.year
df_power['Month'] = df_power.index.month_name()
df_power['Weekday_Name']=df_power.index.day_name()

# Display a random sampling of 5


rowsprint(df_power.sample(5,random_state=0))

#Let's generate a line plot of the full time series of Germany's daily electricity
consumptionprint(df_power.loc['2015-10-02'])
sns.set(rc={'figure.figsize':(11,4)})
plt.rcParams['figure.figsize']=(8,5)
plt.rcParams['figure.dpi'] = 100
df_power['Consumption'].plot(linewidth=0.5)
plt.show()

#Let'susethedotstoplotthedataforalltheothercolumnscols_to_plot=['
Consumption','Solar', 'Wind']
axes = df_power[cols_to_plot].plot(marker='.', alpha=0.5,linestyle='None',figsize=(14,
6),subplots=True)
for ax in axes:
ax.set_ylabel('DailyTotals(GWh)')
plt.show()
#Wecanfurtherinvestigateasingleyeartohaveacloserlook
ax= df_power.loc['2016', 'Consumption'].plot()

62
ax.set_ylabel('Daily Consumption (GWh)')
plt.show()
#Let'sexaminethemonthofDecember2016
ax = df_power.loc['2016-12', 'Consumption'].plot(marker='o', linestyle='-')
ax.set_ylabel('DailyConsumption(GWh)')
plt.show()

#To indicate power consumotion in a particular week of December, we can supply a specificdate
range
ax = df_power.loc['2016-12-23':'2016-12-30','Consumption'].plot(marker='o', linestyle='-')
ax.set_ylabel('DailyConsumption(GWh)');
plt.show()

#Wecanfirstgroupthedatabymonthsandthenusetheboxplotstovisualizethedata:fig, axes =
plt.subplots(3, 1,figsize=(8, 7),sharex=True)
for name, ax in zip(['Consumption', 'Solar', 'Wind'], axes):
sns.boxplot(data=df_power, x='Month', y=name, ax=ax)
ax.set_ylabel('GWh')
ax.set_title(name)
if ax!=axes[-1]:
ax.set_xlabel('')
plt.show()

# we can group the consumption of electricity by the day of the week, and present it in a boxplot:
sns.boxplot(data=df_power, x='Weekday_Name', y='Consumption');
plt.show()

#Wecanusethecodegivenheretoresampleourdata:
columns = ['Consumption', 'Wind', 'Solar', 'Wind+Solar']
power_weekly_mean = df_power[columns].resample('W').mean()
print(power_weekly_mean)

#Lastsixmonthsof2016
start,end='2016-01','2016-06'
fig,ax=plt.subplots()
ax.plot(df_power.loc[start:end, 'Solar'],marker='.', linestyle='-', linewidth=0.5, label='Daily')
ax.plot(power_weekly_mean.loc[start:end, 'Solar'],marker='o', markersize=8, linestyle='-
',label='WeeklyMeanResample')
ax.set_ylabel('Solar Production in (GWh)')
ax.legend()
plt.show()

63
OUTPUT
Index(['Date', 'Consumption', 'Wind', 'Solar', 'Wind+Solar'],
dtype='object')(4383,5)
DateConsumption WindSolarWind+Solar
4373 2017-12-22 1423.23782 228.773 10.065 238.838
4374 2017-12-23 1272.17085 748.074 8.450 756.524
4375 2017-12-24 1141.75730 812.422 9.949 822.371
4376 2017-12-25 1111.28338 587.810 15.765 603.575
4377 2017-12-26 1130.11683 717.453 30.923 748.376
4378 2017-12-27 1263.94091 394.507 16.530 411.037
4379 2017-12-28 1299.86398 506.424 14.162 520.586
4380 2017-12-29 1295.08753 584.277 29.854 614.131
4381 2017-12-30 1215.44897 721.247 7.467 728.714
4382 2017-12-31 1107.11488 721.176 19.980 741.156
Date object
Consumptionfloat64Wi
nd float64
Solar
float64
Wind+Solarfloat64dtyp
e:object
Date
datetime64[ns]C
onsumption float64
Wind float64
Solar float64
Wind+Solar
float64
dtype:object
Consumption WindSolarWind+Solar
Date
2017-12-291295.08753584.27729.854 614.131
2017-12-301215.44897721.2477.467 728.714
2017-12-311107.11488721.17619.980 741.156
DatetimeIndex(['2006-01-01','2006-01-02','2006-01-03','2006-01-04',
'2006-01-05','2006-01-06','2006-01-07','2006-01-08',
'2006-01-09','2006-01-10',
...
'2017-12-22','2017-12-23','2017-12-24','2017-12-25',
'2017-12-26','2017-12-27','2017-12-28','2017-12-29',
'2017-12-30','2017-12-31'],
dtype='datetime64[ns]',name='Date',length=4383,freq=None)Cons
umption Wind Solar...Year MonthWeekday_Name
Date ...
2008-08-23 1152.011 NaN NaN...2008August Saturday
2013-08-08 1291.98479.666 93.371... 2013August Thursday
2009-08-27 1281.057 NaN NaN...2009August Thursday
64
2015-10-02 1391.05081.229 160.641...2015October Friday
2009-06-02 1201.522 NaN NaN... 2009 June Tuesday

65
[5rowsx7columns]
Consumption 1391.05
Wind 81.229
Solar 160.641
Wind+Solar 241.87
Year 2015
Month
OctoberWee
kday_Name Friday
Name:2015-10-0200:00:00, dtype:object

66
RESULT

Theimplementation of the concept ofTimeSeries Analysis in pythonhasbeenexecutedsuccessfully.

67

You might also like