Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
67 views40 pages

Computing & Data Analysis Guide

The document discusses three practical assignments related to statistical analysis in Python: 1. Perform descriptive statistics and analyze a dataset containing age and rating columns. 2. Connect to a MySQL database, show tables, and import data from MySQL. 3. Conduct various statistical hypothesis tests including one-sample t-test, independent two-sample t-test, paired t-test, chi-squared goodness-of-fit test, and chi-squared test of independence using sample datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views40 pages

Computing & Data Analysis Guide

The document discusses three practical assignments related to statistical analysis in Python: 1. Perform descriptive statistics and analyze a dataset containing age and rating columns. 2. Connect to a MySQL database, show tables, and import data from MySQL. 3. Conduct various statistical hypothesis tests including one-sample t-test, independent two-sample t-test, paired t-test, chi-squared goodness-of-fit test, and chi-squared test of independence using sample datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Research in Computing

Practical No: 1
A) Write a program for obtaining descriptive statistics of data.

code:

import pandas as pd
#Create a Dictionary of series
d={'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
#Create a DataFrame
df=pd.DataFrame(d)
print(df)
print("----------Sum-----------")
print(df.sum())
print("----------Mean----------")
print(df.mean())
print("---------------Standard Deviation-------")
print(df.std())
print("---------------Descriptive Statistics------------")
print(df.describe())

Output:
B) Write a program for Import data from the mysql
code:

import mysql.connector
conn = mysql.connector.connect(host='localhost',database='information_schema', user='root',password='root')
conn.connect
if(conn.is_connected):
print('###### Connection With MySql Established Successfullly ##### ')
else:
print('Not Connected -- Check Connection Properites')

import mysql.connector
mycursor = conn.cursor()
mycursor.execute("show tables;")
myresult = mycursor.fetchall()
for x in myresult:
print(x)
Output:
Microsoft Excel
##################Retrieve-Country-Currency.p
Code:
import os
import pandas as pd
################################################################
Base='C:/VKHCG'
################################################################
sFileDir=Base + '/01-Vermeulen/01-Retrieve/01-EDS/02-Python'
#if not os.path.exists(sFileDir):
#os.makedirs(sFileDir)
################################################################
CurrencyRawData = pd.read_excel('C:/VKHCG/01-Vermeulen/00-RawData/Country_Currency.xlsx')
sColumns = ['Country or territory', 'Currency', 'ISO-4217']
CurrencyData = CurrencyRawData[sColumns]
CurrencyData.rename(columns={'Country or territory': 'Country', 'ISO-4217':
'CurrencyCode'}, inplace=True)
CurrencyData.dropna(subset=['Currency'],inplace=True)
CurrencyData['Country'] = CurrencyData['Country'].map(lambda x: x.strip())
CurrencyData['Currency'] = CurrencyData['Currency'].map(lambda x:
x.strip())
CurrencyData['CurrencyCode'] = CurrencyData['CurrencyCode'].map(lambda x:
x.strip())
print(CurrencyData)
print('~~~~~~ Data from Excel Sheet Retrived Successfully ~~~~~~~ ')
################################################################
sFileName=sFileDir + '/Retrieve-Country-Currency.csv'
CurrencyData.to_csv(sFileName, index = False)
Output:
Practical No: 2

A popular electronic store want to conduct a survey to develop awareness of branded laptop baseline
estimates and determine popularity of different company’s laptop. It suggests steps to be initiated or
strengthened in the field of demand in a region. The key indicators are among the general population,
demand branded laptop and the problem users.
The objectives of this particular study are:-
1. To know the preferences of different types of branded laptops by students and professionals.
2. To study which factor influence for choosing different types of branded laptops.
3. To know about the level of satisfaction towards different types of branded laptops.
4. To identify the perception of consumers towards the laptop positioning strategy.
5. To know the consumer preference towards laptop in the present era.
Use the collected data for analysis.

B. Perform analysis of given secondary data.


Steps in Secondary Data Analysis
1. Determine your research question – Knowing exactly what you are looking for.
2. Locating data– Knowing what is out there and whether you can gain access to it. A quick Internet
search, possibly with the help of a librarian, will reveal a wealth of options.
3. Evaluating relevance of the data – Considering things like the data’s original purpose, when it was
collected, population, sampling strategy/sample, data collection protocols, operationalization of concepts,
questions asked, and form/shape of the data.
4. Assessing credibility of the data – Establishing the credentials of the original researchers, searching
for full explication of methods including any problems encountered, determining how consistent the data
is with data from other sources, and discovering whether the data has been used in any credible published
research.
5. Analysis – This will generally involve a range of statistical processes.
Example: Analyze the given Population Census Data for Planning and Decision
Making by using the size and composition of populations.
Practical No: 3
A) Perform testing of hypothesis using one sample t-test.
Program Code:
from scipy.stats import ttest_1samp
import numpy as np
ages=np.genfromtxt('ages.csv')
print(ages)
ages_mean=np.mean(ages)
print(ages_mean)
tset,pval=ttest_1samp(ages,30)
print('p-values-',pval)
if pval<0.05: #alpha value is 0.05
print("we are rejecting null hypothesis")
else:
print("we are accepting null hypothesis")

Output:
B) Write a program for t-test comparing two means for independent samples.
Program code:
import numpy as np
from scipy import stats
from numpy.random import randn
N=20
#a=[35,40,12,15,21,14,46,10,28,48,16,30,32,48,31,22,12,39,19,25]
#b=[2,27,31,38,1,,19,1,34,3,1,2,1,3,1,2,1,3,29,37,2]
a=5*randn(100)+50
b=5*randn(100)+51
var_a=a.var(ddof=1)
var_b=b.var(ddof=1)
s=np.sqrt((var_a+var_b)/2)
t=(a.mean()-b.mean())/(s*np.sqrt(2/N))
df=2*N-2
p=1-stats.t.cdf(t,df=df)
print("t=" + str(t))
print("p="+str(2*p))
if t>p:
print('mean of two distribution are different and significant')
else:
print('mean of two distribution are same and not significant')

Output:
C) Perform testing of hypothesis using paired t-test.
The paired sample t-test is also called dependent sample t-test. It’s an univariate test that tests for a
significant difference between 2 related variables. An example of this is if you where to collect the blood
pressure for an individual before and after some treatment, condition, or time point. The data set
contains blood pressure readings before and after an intervention. These are variables “bp_before” and
“bp_after”. The hypothesis being test is:
• H0 - The mean difference between sample 1 and sample 2 is equal to 0.
• H0 - The mean difference between sample 1 and sample 2 is not equal to 0

Program Code:
from scipy import stats
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("C:/Users/User-04/Documents/Dimple_Baroliya/RIC/prac3/blood_pressure.csv")
print(df[['bp_before','bp_after']].describe())
#First let’s check for any significant outliers in
#each of the variables.
df[['bp_before', 'bp_after']].plot(kind='box')
# This saves the plot as a png file
plt.savefig('boxplot_outliers.png')
# make a histogram to differences between the two scores.
df['bp_difference'] = df['bp_before'] - df['bp_after']
df['bp_difference'].plot(kind='hist', title= 'Blood Pressure Difference Histogram')
#Again, this saves the plot as a png file
plt.savefig('blood pressure difference histogram.png')
stats.probplot(df['bp_difference'], plot= plt)
plt.title('Blood pressure Difference Q-Q Plot')
plt.savefig('blood pressure difference qq plot.png')
stats.shapiro(df['bp_difference'])
stats.ttest_rel(df['bp_before'], df['bp_after'])

Output:

Practical No: 4
A) Perform testing of hypothesis using chi-squared goodnessof-fit test.

A system administrator needs to upgrade the computers for his division. He wants to know what sort
of computer system his workers prefer. He gives three choices: Windows, Mac, or Linux. Test the
hypothesis or theory that an equal percentage of the population prefers each type of computer
system.
H0 : The population distribution of the variable is the same as the proposed distribution
HA : The distributions are different
To calculate the Chi –Squared value for Windows go to cell D2 and type =((B2- C2)*(B2-C2))/C2
To calculate the Chi –Squared value for Mac go to cell D3 and type =((B3-C3)*(B3- C3))/C3
To calculate the Chi –Squared value for Mac go to cell D3 and type =((B4-C4)*(B4- C4))/C4 Go to
Cell D5 for and type=SUM(D2:D4)
To get the table value for Chi-Square for α = 0.05 and dof = 2, go to cell D7 and type
=CHIINV(0.05,2) At cell D8 type =IF(D5>D7, "H0 Accepted","H0 Rejected")

Output:

B) Perform testing of hypothesis using chi-squared test of independence.


In a study to understand the performance of M. Sc. IT Part -1 class, a college selects a random sample of
100 students. Each student was asked his grade obtained in B. Sc. IT. The sample is as given below

Output:
Null Hypothesis - H0 : The performance of girls students is same as boys students. Alternate Hypothesis
- H1 : The performance of boys and girls students are different.
Open Excel Workbook
Prepare a contingency table as shown above.
To calculate Girls Students with ‘O’ Grade Go to Cell N6 and type =COUNTIF($J$2:$K$40,"O") To
calculate Girls Students with ‘A’ Grade Go to Cell O6 and type =COUNTIF($J$2:$K$40,"A") To
calculate Girls Students with ‘B’ Grade Go to Cell P6 and type =COUNTIF($J$2:$K$40,"B") To
calculate Girls Students with ‘C’ Grade Go to Cell Q6 and type =COUNTIF($J$2:$K$40,"C") To
calculate Girls Students with ‘D’ Grade Go to Cell R6 and type =COUNTIF($J$2:$K$40,"D") To
calculate Boys Students with ‘O’ Grade Go to Cell N7 and type=COUNTIF($D$2:$E$62,"O") To
calculate Boys Students with ‘A’ Grade Go to Cell O7 and type=COUNTIF($D$2:$E$62,"A") To
calculate Boys Students with ‘B’ Grade Go to Cell P7 and type =COUNTIF($D$2:$E$62,"B") To
calculate Boys Students with ‘C’ Grade Go to Cell Q7 and type =COUNTIF($D$2:$E$62,"C") To
calculate Boys Students with ‘D’ Grade Go to Cell R7 and type =COUNTIF($D$2:$E$62,"D")
To calculate the expected value Ei
Go to Cell N9 and type =N8/2
Go to Cell O9 and type =O8/2
Go to Cell P9 and type =P8/2
Go to Cell Q9 and type =Q8/2
Go to Cell R9 and type =R8/2
Go to Cell S6 and calculate total girl students = SUM(N6:R6)
Go to Cell S7 and calculate total girl students = SUM(N7:R7)

Now Calculate
Go to cell T6 and type
=SUM((N6-$N$9)^2/$N$9,(O6-$O$9)^2/$O$9,(P6-$P$9)^2/$P$9,(Q6-Q$9)^2/$Q$9, (R6-
$R$9)^2/$R$9)
Go to cell T7 and type
=SUM((N7-$N$9)^2/$N$9,(O7-$O$9)^2/$O$9,(P7-$P$9)^2/$P$9,(Q7-Q$9)^2/$Q$9, (R7-
$R$9)^2/$R$9)
To get the table value go to cell T11 and type
=CHIINV(0.05,4) Go to cell O13 and type =IF(T8>=T11," H0 is Accepted", "H0 is Rejected")
OUTPUT:
USING EXCEL:
USING PYTHON:
import numpy as np
import pandas as pd
import scipy.stats as stats
np.random.seed(10)
stud_grade = np.random.choice(a=["O","A","B","C","D"], p=[0.20, 0.20 ,0.20, 0.20, 0.20], size=100)
stud_gen = np.random.choice(a=["Male","Female"], p=[0.5, 0.5], size=100)
mscpart1 = pd.DataFrame({"Grades":stud_grade, "Gender":stud_gen})
print(mscpart1)
stud_tab = pd.crosstab(mscpart1.Grades, mscpart1.Gender, margins=True)
stud_tab.columns = ["Male", "Female", "row_totals"]
stud_tab.index = ["O", "A", "B", "C", "D", "col_totals"]
observed = stud_tab.iloc[0:5, 0:2 ]
print(observed)
expected = np.outer(stud_tab["row_totals"][0:5],
stud_tab.loc["col_totals"][0:2]) / 100
print(expected)
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()
print('Calculated : ',chi_squared_stat)
crit = stats.chi2.ppf(q=0.95, df=4)
print('Table Value : ',crit)
if chi_squared_stat<= crit:
print('H0 is Accepted ')
else:
print('H0 is Rejected ')

OUTPUT:
Practical No: 5
Perform testing of hypothesis using Z-test.

Use a Z test if:


• Your sample size is greater than 30. Otherwise, use a t test.
• Data points should be independent from each other. In other words, one data point isn’t related or
doesn’t affect another data point.
• Your data should be normally distributed. However, for large sample sizes (over 30) this doesn’t always
matter.
• Your data should be randomly selected from a population, where each item has an equal chance of
being selected.
• Sample sizes should be equal if at all possible.
Ho - Blood pressure has a mean of 156 units
Program Code for one-sample Z test.

from statsmodels.stats import weightstats as stests


import pandas as pd
from scipy import stats
df = pd.read_csv("blood_pressure.csv")
df[['bp_before','bp_after']].describe()
print(df)
ztest ,pval = stests.ztest(df['bp_before'], x2=None, value=156)
print(float(pval))
if pval<0.05:
print("reject null hypothesis")
else:
print("accept null hypothesis")

Output:
Two-sample Z test- In two sample z-test , similar to t-test here we are checking two independent data
groups and deciding whether sample mean of two group is equal or not.

H0 : mean of two group is 0

H1 : mean of two group is not 0

## Two-sample Z test- In two sample z-test

import pandas as pd

from statsmodels.stats import weightstats as stests

df = pd.read_csv("blood_pressure.csv")

df[['bp_before','bp_after']].describe()

print(df)

ztest ,pval = stests.ztest(df['bp_before'], x2=df['bp_after'], value=0, alternative='two-sided')

print(float(pval))

if pval<0.05:

print("reject null hypothesis")

else:

print("accept null hypothesis")

Output:

Practical No: 6
A. Perform testing of hypothesis using One-way ANOVA.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
data = pd.read_csv("C:/Users/User-04/Documents/Dimple_Baroliya/RIC/prac6/
scores.csv")
data.head()
data['Borough'].value_counts()
############### There is no total score column, have to create it.
########In addition, find the mean score of the each district across all schools.
data['total_score'] = data['Average Score (SAT Reading)'] + \
data['Average Score (SAT Math)'] + \
data['Average Score (SAT Writing)']
data = data[['Borough', 'total_score']].dropna()
x = ['Brooklyn', 'Bronx', 'Manhattan', 'Queens', 'Staten Island']
district_dict = {}
#Assigns each test score series to a dictionary key
for district in x:
district_dict[district] = data[data['Borough'] == district]['total_score']
y = []
yerror = []
#Assigns the mean score and 95% confidence limit to each district
for district in x:
y.append(district_dict[district].mean())
yerror.append(1.96*district_dict[district].std()/np.sqrt(district_dict[district].shape[0]))
print(district + '_std : {}'.format(district_dict[district].std()))
sns.set(font_scale=1.8)
fig = plt.figure(figsize=(10,5))
#ax = sns.barplot(x, y, yerr=yerror)
#ax.set_ylabel('Average Total SAT Score')
#plt.show()
###################### Perform 1-way ANOVA
print(stats.f_oneway(
district_dict['Brooklyn'], district_dict['Bronx'], \
district_dict['Manhattan'], district_dict['Queens'], \
district_dict['Staten Island']
))
districts = ['Brooklyn', 'Bronx', 'Manhattan', 'Queens', 'Staten Island']
ss_b = 0
for d in districts:
ss_b += district_dict[d].shape[0] * \
np.sum((district_dict[d].mean() - data['total_score'].mean())**2)
ss_w = 0
for d in districts:
ss_w += np.sum((district_dict[d] - district_dict[d].mean())**2)
msb = ss_b/4
msw = ss_w/(len(data)-5)
f=msb/msw
print('F_statistic: {}'.format(f))
ss_t = np.sum((data['total_score']-data['total_score'].mean())**2)
eta_squared = ss_b/ss_t
print('eta_squared: {}'.format(eta_squared))
Output:

B. Perform testing of hypothesis using Two-way ANOVA.

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.graphics.factorplots import interaction_plot
import matplotlib.pyplot as plt
from scipy import stats
def eta_squared(aov):
aov['eta_sq'] = 'NaN'
aov['eta_sq'] = aov[:-1]['sum_sq']/sum(aov['sum_sq'])
return aov
def omega_squared(aov):
mse = aov['sum_sq'][-1]/aov['df'][-1]
aov['omega_sq'] = 'NaN'
aov['omega_sq'] = (aov[:-1]['sum_sq']-(aov[:-
1]['df']*mse))/(sum(aov['sum_sq'])+mse)
return aov
datafile = "ToothGrowth.csv"
data = pd.read_csv(datafile)
fig = interaction_plot(data.dose, data.supp, data.len,
colors=['red','blue'], markers=['D','^'], ms=10)
N = len(data.len)
df_a = len(data.supp.unique()) - 1
df_b = len(data.dose.unique()) - 1
df_axb = df_a*df_b
df_w = N - (len(data.supp.unique())*len(data.dose.unique()))
grand_mean = data['len'].mean()
#Sum of Squares A – supp
ssq_a = sum([(data[data.supp ==l].len.mean()-grand_mean)**2 for l in data.supp])
#Sum of Squares B – supp
ssq_b = sum([(data[data.dose ==l].len.mean()-grand_mean)**2 for l in data.dose])
#Sum of Squares Total
ssq_t = sum((data.len - grand_mean)**2)
vc = data[data.supp == 'VC']
oj = data[data.supp == 'OJ']
vc_dose_means = [vc[vc.dose == d].len.mean() for d in vc.dose]
oj_dose_means = [oj[oj.dose == d].len.mean() for d in oj.dose]
ssq_w = sum((oj.len - oj_dose_means)**2) +sum((vc.len - vc_dose_means)**2)
ssq_axb = ssq_t-ssq_a-ssq_b-ssq_w
ms_a = ssq_a/df_a #Mean Square A
ms_b = ssq_b/df_b #Mean Square B
ms_axb = ssq_axb/df_axb #Mean Square AXB
ms_w = ssq_w/df_w
f_a = ms_a/ms_w
f_b = ms_b/ms_w
f_axb = ms_axb/ms_w
p_a = stats.f.sf(f_a, df_a, df_w)
p_b = stats.f.sf(f_b, df_b, df_w)
p_axb = stats.f.sf(f_axb, df_axb, df_w)
results = {'sum_sq':[ssq_a, ssq_b, ssq_axb, ssq_w],
'df':[df_a, df_b, df_axb, df_w],
'F':[f_a, f_b, f_axb, 'NaN'],
'PR(>F)':[p_a, p_b, p_axb, 'NaN']}
columns=['sum_sq', 'df', 'F', 'PR(>F)']
aov_table1 = pd.DataFrame(results, columns=columns,
index=['supp', 'dose',
'supp:dose', 'Residual'])
formula = 'len ~ C(supp) + C(dose) + C(supp):C(dose)'
model = ols(formula, data).fit()
aov_table = anova_lm(model, typ=2)
eta_squared(aov_table)
omega_squared(aov_table)
print(aov_table.round(4))
res = model.resid
fig = sm.qqplot(res, line='s')
plt.show()

Output:
C. Perform testing of hypothesis using MANOVA.

MANOVA is the acronym for Multivariate Analysis of Variance. When analyzing data, we may encounter
situations where we have there multiple response variables (dependent variables). In MANOVA there
also some assumptions, like ANOVA. Before performing MANOVA we have to check the following
assumptions are satisfied or not.
• The samples, while drawing, should be independent of each other.
• The dependent variables are continuous in nature and the independent variables are categorical.
• The dependent variables should follow a multivariate normal distribution.
• The population variance-covariance matrices of each group are same, i.e. groups are homogeneous.
Code:
import pandas as pd
from statsmodels.multivariate.manova import MANOVA
df = pd.read_csv('C:/Users/User-04/Documents/Dimple_Baroliya/RIC/prac6/iris.csv',
index_col=0)
df.columns = df.columns.str.replace(".", "_")
df.head()
print('~~~~~~~~ Data Set ~~~~~~~~')
print(df)
maov = MANOVA.from_formula('Sepal_Length + Sepal_Width + \
Petal_Length + Petal_Width ~ Species', data=df)
print('~~~~~~~~ MANOVA Test Result ~~~~~~~~')
print(maov.mv_test())
Output:

Practical No: 7
A. Perform the Random sampling for the given data and analyse it.
Example 1: From a population of 10 women and 10 men as given in the table in Figure 1 on the
left below, create a random sample of 6 people for Group 1 and a periodic sample consisting of
every 3rd woman for Group 2.
You need to run the sampling data analysis tool twice, once to create Group 1 and again to create
Group 2. For Group 1 you select all 20 population cells as the Input Range and Random as the
Sampling Method with 6 for the Random Number of Samples. For Group 2 you select the 10
cells in the Women column as Input Range and Periodic with Period 3.
Open existing excel sheet with population data Sample Sheet looks as given below
Set Cell O1 = Male and Cell O2 = Female
To generate a random sample for male students from given population go to Cell O1 and type
=INDEX(E$2:E$62,RANK(B2,B$2:B$62))
Drag the formula to the desired no of cell to select random sample
Now, to generate a random sample for female students go to cell P1 and type
=INDEX(K$2:K$40,RANK(H2,H$2:H$40))
Drag the formula to the desired no of cell to select random sample

Output:
B. Perform the Stratified sampling for the given data and analyse it.
we are to carry out a hypothetical housing quality survey across Lagos state, Nigeria. And we looking at a
total of 5000 houses (hypothetically). We don’t just go to one local government and select 5000 houses,
rather we ensure that the 5000 houses are a representative of the whole 20 local government areas
Lagos state is comprised of. This is called stratified sampling. The population is divided into homogenous
strata and the right number of instances is sampled from each stratum to guarantee that the test-set
(which in this case is the 5000 houses) is a representative of the overall population. If we used random
sampling, there would be a significant chance of having bias in the survey results.

Command: C:\Users\User-04\AppData\Local\Programs\Python\Python39\Scripts>pip install


scikit-learn

Program Code:

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import sklearn
from sklearn.model_selection import train_test_split
housing =pd.read_csv('C:/Users/User-04/Documents/Dimple_Baroliya/RIC/prac7/housing.csv')
print(housing.head())
print(housing.info())
#creating a heatmap of the attributes in the dataset
correlation_matrix = housing.corr()
plt.subplots(figsize=(8,6))
sns.heatmap(correlation_matrix, center=0, annot=True, linewidths=.3)
corr =housing.corr()
print(corr['median_house_value'].sort_values(ascending=False))
sns.distplot(housing.median_income)
plt.show()

Output:
Practical No: 8
Write a program for computing different correlation.

Positive Correlation: Let’s take a look at a positive correlation. Numpy implements a corrcoef() function
that returns a matrix of correlations of x with x, x with y, y with x and y with y. We’re interested in the
values of correlation of x with y (so position (1, 0) or (0, 1)).

Code:

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
# 1000 random integers between 0 and 50
x = np.random.randint(0, 50, 1000)
# Positive Correlation with some noise
y = x + np.random.normal(0, 10, 1000)
np.corrcoef(x, y)
#matplotlib.style.use('ggplot')
plt.scatter(x, y)
plt.show()
Output:

Negative Correlation:

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
# 1000 random integers between 0 and 50
x = np.random.randint(0, 50, 1000)
# Negative Correlation with some noise
y = 100 - x + np.random.normal(0, 5, 1000)
np.corrcoef(x, y)
plt.scatter(x, y)
plt.show()

Output:
No/Weak Correlation:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
x = np.random.randint(0, 50, 1000)
y = np.random.randint(0, 50, 1000)
np.corrcoef(x, y)
plt.scatter(x, y)
plt.show()

Output:
Practical No: 9
A. Write a program to Perform linear regression for prediction.

## Practical 9 A Write a program to Perform linear regression for prediction.

import quandl, math


import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from matplotlib import style
import datetime
style.use('ggplot')
df = quandl.get("WIKI/GOOGL")
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']]
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] * 100.0
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0
df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]
forecast_col = 'Adj. Close'
df.fillna(value=-99999, inplace=True)
forecast_out = int(math.ceil(0.01 * len(df)))
df['label'] = df[forecast_col].shift(-forecast_out)
X = np.array(df.drop(['label'], 1))
X = preprocessing.scale(X)
X_lately = X[-forecast_out:]
X = X[:-forecast_out]
df.dropna(inplace=True)
y = np.array(df['label'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


clf = LinearRegression(n_jobs=-1)
clf.fit(X_train, y_train)
confidence = clf.score(X_test, y_test)
forecast_set = clf.predict(X_lately)
df['Forecast'] = np.nan
last_date = df.iloc[-1].name
last_unix = last_date.timestamp()
one_day = 86400
next_unix = last_unix + one_day

for i in forecast_set:
next_date = datetime.datetime.fromtimestamp(next_unix)
next_unix += 86400
df.loc[next_date] = [np.nan for _ in range(len(df.columns)-1)]+[i]
df['Adj. Close'].plot()
df['Forecast'].plot()
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()

Output:
B. Perform polynomial regression for prediction.

import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x, m_y = np.mean(x), np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return(b_0, b_1)

def plot_regression_line(x, y, b):


# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()

def main():
# observations
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} b_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()

Output:
Practical No: 10
A. Write a program for multiple linear regression analysis.

Step #1: Data Pre Processing


a) Importing The Libraries.
b) Importing the Data Set.
c) Encoding the Categorical Data.
d) Avoiding the Dummy Variable Trap.
e) Splitting the Data set into Training Set and Test Set.
Step #2: Fitting Multiple Linear Regression to the Training set Step #3: Predicting the Test set results

import numpy as np
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
def generate_dataset(n):
x = []
y = []
random_x1 = np.random.rand()
random_x2 = np.random.rand()
for i in range(n):
x1 = i
x2 = i/2 + np.random.rand()*n
x.append([1, x1, x2])
y.append(random_x1 * x1 + random_x2 * x2 + 1)
return np.array(x), np.array(y)
x, y = generate_dataset(200)
mpl.rcParams['legend.fontsize'] = 12
fig = plt.figure()
ax = fig.add_subplot(projection= '3d')
ax.scatter(x[:, 1], x[:, 2], y, label ='y', s = 5)
ax.legend()
ax.view_init(45, 0)
plt.show()
def mse(coef, x, y):
return np.mean((np.dot(x, coef) - y)**2)/2
def gradients(coef, x, y):
return np.mean(x.transpose()*(np.dot(x, coef) - y), axis = 1)
def multilinear_regression(coef, x, y, lr, b1 = 0.9, b2 = 0.999, epsilon = 1e-8):
prev_error = 0
m_coef = np.zeros(coef.shape)
v_coef = np.zeros(coef.shape)
moment_m_coef = np.zeros(coef.shape)
moment_v_coef = np.zeros(coef.shape)
t=0
while True:
error = mse(coef, x, y)
if abs(error - prev_error) <= epsilon:
break
prev_error = error
grad = gradients(coef, x, y)
t += 1
m_coef = b1 * m_coef + (1-b1)*grad
v_coef = b2 * v_coef + (1-b2)*grad**2
moment_m_coef = m_coef / (1-b1**t)
moment_v_coef = v_coef / (1-b2**t)

delta = ((lr / moment_v_coef**0.5 + 1e-8) *


(b1 * moment_m_coef + (1-b1)*grad/(1-b1**t)))
coef = np.subtract(coef, delta)
return coef
coef = np.array([0, 0, 0])
c = multilinear_regression(coef, x, y, 1e-1)
fig = plt.figure()
ax = fig.add_subplot(projection= '3d')
ax.scatter(x[:, 1], x[:, 2], y, label ='y',
s = 5, color ="dodgerblue")
ax.scatter(x[:, 1], x[:, 2], c[0] + c[1]*x[:, 1] + c[2]*x[:, 2],
label ='regression', s = 5, color ="orange")
ax.view_init(45, 0)
ax.legend()

Output:

B. Perform logistic regression analysis.

import os
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats as stats
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import metrics
matplotlib.style.use('ggplot')
plt.figure(figsize=(9,9))
def sigmoid(t): # Define the sigmoid function
return (1/(1 + np.e**(-t)))
plot_range = np.arange(-6, 6, 0.1)
y_values = sigmoid(plot_range)
# Plot curve
plt.plot(plot_range, # X-axis range
y_values, # Predicted values
color="red")
titanic_train = pd.read_csv("C:/Users/SONY/AppData/Local/Programs/Python/Python37-32/
titanic_train.csv") # Read the data
char_cabin = titanic_train["Cabin"].astype(str) # Convert cabin to str
new_Cabin = np.array([cabin[0] for cabin in char_cabin]) # Take first letter
titanic_train["Cabin"] = pd.Categorical(new_Cabin) # Save the new cabin var
# Impute median Age for NA Age values
new_age_var = np.where(titanic_train["Age"].isnull(), # Logical check

28, # Value if check is tru


titanic_train["Age"]) # Value if check is false
titanic_train["Age"] = new_age_var
label_encoder = preprocessing.LabelEncoder()
# Convert Sex variable to numeric
encoded_sex = label_encoder.fit_transform(titanic_train["Sex"])
# Initialize logistic regression model
log_model = linear_model.LogisticRegression()
# Train the model
log_model.fit(X = pd.DataFrame(encoded_sex),
y = titanic_train["Survived"])
# Check trained model intercept
print(log_model.intercept_)
# Check trained model coefficients
print(log_model.coef_)
# Make predictions
preds = log_model.predict_proba(X= pd.DataFrame(encoded_sex))
preds = pd.DataFrame(preds)
preds.columns = ["Death_prob", "Survival_prob"]
# Generate table of predictions vs Sex
pd.crosstab(titanic_train["Sex"], preds.loc[:, "Survival_prob"])
# Convert more variables to numeric
encoded_class = label_encoder.fit_transform(titanic_train["Pclass"])
encoded_cabin = label_encoder.fit_transform(titanic_train["Cabin"])
train_features = pd.DataFrame([encoded_class,
encoded_cabin,
encoded_sex,
titanic_train["Age"]]).T
# Initialize logistic regression model
log_model = linear_model.LogisticRegression()
# Train the model
log_model.fit(X = train_features ,
y = titanic_train["Survived"])
# Check trained model intercept
print(log_model.intercept_)
# Check trained model coefficients
print(log_model.coef_)
# Make predictions
preds = log_model.predict(X= train_features)
# Generate table of predictions vs actual
pd.crosstab(preds,titanic_train["Survived"])
log_model.score(X = train_features ,
y = titanic_train["Survived"])
metrics.confusion_matrix(y_true=titanic_train["Survived"], # True labels
y_pred=preds) # Predicted labels
# View summary of common classification metrics
print(metrics.classification_report(y_true=titanic_train["Survived"],
y_pred=preds) )
# Read and prepare test data
titanic_test = pd.read_csv("C:/Users/SONY/AppData/Local/Programs/Python/Python37-32/
titanic_test.csv") # Read the data
char_cabin = titanic_test["Cabin"].astype(str) # Convert cabin to str
new_Cabin = np.array([cabin[0] for cabin in char_cabin]) # Take first letter
titanic_test["Cabin"] = pd.Categorical(new_Cabin) # Save the new cabin var
# Impute median Age for NA Age values
new_age_var = np.where(titanic_test["Age"].isnull(), # Logical check
28, # Value if check is true
titanic_test["Age"]) # Value if check is false
titanic_test["Age"] = new_age_var
# Convert test variables to match model features
encoded_sex = label_encoder.fit_transform(titanic_test["Sex"])
encoded_class = label_encoder.fit_transform(titanic_test["Pclass"])
encoded_cabin = label_encoder.fit_transform(titanic_test["Cabin"])
test_features = pd.DataFrame([encoded_class,
encoded_cabin,encoded_sex,titanic_test["Age"]]).T

# Make test set predictions


test_preds = log_model.predict(X=test_features)
# Create a submission for Kaggle
submission = pd.DataFrame({"PassengerId":titanic_test["PassengerId"],
"Survived":test_preds})
# Save submission to CSV
submission.to_csv("tutorial_logreg_submission.csv",
index=False) # Do not save index values
print(pd)
Output:

You might also like