0% found this document useful (0 votes)

67 views40 pages

Computing & Data Analysis Guide

The document discusses three practical assignments related to statistical analysis in Python: 1. Perform descriptive statistics and analyze a dataset containing age and rating columns. 2. Connect to a MySQL database, show tables, and import data from MySQL. 3. Conduct various statistical hypothesis tests including one-sample t-test, independent two-sample t-test, paired t-test, chi-squared goodness-of-fit test, and chi-squared test of independence using sample datasets.

Uploaded by

Gayatri Dilip Juwatkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views40 pages

Computing & Data Analysis Guide

Uploaded by

Gayatri Dilip Juwatkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Research in Computing

Practical No: 1
A) Write a program for obtaining descriptive statistics of data.

code:

import pandas as pd
#Create a Dictionary of series
d={'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}
#Create a DataFrame
df=pd.DataFrame(d)
print(df)
print("----------Sum-----------")
print(df.sum())
print("----------Mean----------")
print(df.mean())
print("---------------Standard Deviation-------")
print(df.std())
print("---------------Descriptive Statistics------------")
print(df.describe())

Output:
B) Write a program for Import data from the mysql
code:

import mysql.connector
conn = mysql.connector.connect(host='localhost',database='information_schema', user='root',password='root')
conn.connect
if(conn.is_connected):
print('###### Connection With MySql Established Successfullly ##### ')
else:
print('Not Connected -- Check Connection Properites')

import mysql.connector
mycursor = conn.cursor()
mycursor.execute("show tables;")
myresult = mycursor.fetchall()
for x in myresult:
print(x)
Output:
Microsoft Excel
##################Retrieve-Country-Currency.p
Code:
import os
import pandas as pd
################################################################
Base='C:/VKHCG'
################################################################
sFileDir=Base + '/01-Vermeulen/01-Retrieve/01-EDS/02-Python'
#if not os.path.exists(sFileDir):
#os.makedirs(sFileDir)
################################################################
CurrencyRawData = pd.read_excel('C:/VKHCG/01-Vermeulen/00-RawData/Country_Currency.xlsx')
sColumns = ['Country or territory', 'Currency', 'ISO-4217']
CurrencyData = CurrencyRawData[sColumns]
CurrencyData.rename(columns={'Country or territory': 'Country', 'ISO-4217':
'CurrencyCode'}, inplace=True)
CurrencyData.dropna(subset=['Currency'],inplace=True)
CurrencyData['Country'] = CurrencyData['Country'].map(lambda x: x.strip())
CurrencyData['Currency'] = CurrencyData['Currency'].map(lambda x:
x.strip())
CurrencyData['CurrencyCode'] = CurrencyData['CurrencyCode'].map(lambda x:
x.strip())
print(CurrencyData)
print('~~~~~~ Data from Excel Sheet Retrived Successfully ~~~~~~~ ')
################################################################
sFileName=sFileDir + '/Retrieve-Country-Currency.csv'
CurrencyData.to_csv(sFileName, index = False)
Output:
Practical No: 2

A popular electronic store want to conduct a survey to develop awareness of branded laptop baseline
estimates and determine popularity of different company’s laptop. It suggests steps to be initiated or
strengthened in the field of demand in a region. The key indicators are among the general population,
demand branded laptop and the problem users.
The objectives of this particular study are:-
1. To know the preferences of different types of branded laptops by students and professionals.
2. To study which factor influence for choosing different types of branded laptops.
3. To know about the level of satisfaction towards different types of branded laptops.
4. To identify the perception of consumers towards the laptop positioning strategy.
5. To know the consumer preference towards laptop in the present era.
Use the collected data for analysis.

B. Perform analysis of given secondary data.

Steps in Secondary Data Analysis
1. Determine your research question – Knowing exactly what you are looking for.
2. Locating data– Knowing what is out there and whether you can gain access to it. A quick Internet
search, possibly with the help of a librarian, will reveal a wealth of options.
3. Evaluating relevance of the data – Considering things like the data’s original purpose, when it was
collected, population, sampling strategy/sample, data collection protocols, operationalization of concepts,
questions asked, and form/shape of the data.
4. Assessing credibility of the data – Establishing the credentials of the original researchers, searching
for full explication of methods including any problems encountered, determining how consistent the data
is with data from other sources, and discovering whether the data has been used in any credible published
research.
5. Analysis – This will generally involve a range of statistical processes.
Example: Analyze the given Population Census Data for Planning and Decision
Making by using the size and composition of populations.
Practical No: 3
A) Perform testing of hypothesis using one sample t-test.
Program Code:
from scipy.stats import ttest_1samp
import numpy as np
ages=np.genfromtxt('ages.csv')
print(ages)
ages_mean=np.mean(ages)
print(ages_mean)
tset,pval=ttest_1samp(ages,30)
print('p-values-',pval)
if pval<0.05: #alpha value is 0.05
print("we are rejecting null hypothesis")
else:
print("we are accepting null hypothesis")

Output:
B) Write a program for t-test comparing two means for independent samples.
Program code:
import numpy as np
from scipy import stats
from numpy.random import randn
N=20
#a=[35,40,12,15,21,14,46,10,28,48,16,30,32,48,31,22,12,39,19,25]
#b=[2,27,31,38,1,,19,1,34,3,1,2,1,3,1,2,1,3,29,37,2]
a=5*randn(100)+50
b=5*randn(100)+51
var_a=a.var(ddof=1)
var_b=b.var(ddof=1)
s=np.sqrt((var_a+var_b)/2)
t=(a.mean()-b.mean())/(s*np.sqrt(2/N))
df=2*N-2
p=1-stats.t.cdf(t,df=df)
print("t=" + str(t))
print("p="+str(2*p))
if t>p:
print('mean of two distribution are different and significant')
else:
print('mean of two distribution are same and not significant')

Output:
C) Perform testing of hypothesis using paired t-test.
The paired sample t-test is also called dependent sample t-test. It’s an univariate test that tests for a
significant difference between 2 related variables. An example of this is if you where to collect the blood
pressure for an individual before and after some treatment, condition, or time point. The data set
contains blood pressure readings before and after an intervention. These are variables “bp_before” and
“bp_after”. The hypothesis being test is:
• H0 - The mean difference between sample 1 and sample 2 is equal to 0.
• H0 - The mean difference between sample 1 and sample 2 is not equal to 0

Program Code:
from scipy import stats
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("C:/Users/User-04/Documents/Dimple_Baroliya/RIC/prac3/blood_pressure.csv")
print(df[['bp_before','bp_after']].describe())
#First let’s check for any significant outliers in
#each of the variables.
df[['bp_before', 'bp_after']].plot(kind='box')
# This saves the plot as a png file
plt.savefig('boxplot_outliers.png')
# make a histogram to differences between the two scores.
df['bp_difference'] = df['bp_before'] - df['bp_after']
df['bp_difference'].plot(kind='hist', title= 'Blood Pressure Difference Histogram')
#Again, this saves the plot as a png file
plt.savefig('blood pressure difference histogram.png')
stats.probplot(df['bp_difference'], plot= plt)
plt.title('Blood pressure Difference Q-Q Plot')
plt.savefig('blood pressure difference qq plot.png')
stats.shapiro(df['bp_difference'])
stats.ttest_rel(df['bp_before'], df['bp_after'])

Output:

Practical No: 4
A) Perform testing of hypothesis using chi-squared goodnessof-fit test.

A system administrator needs to upgrade the computers for his division. He wants to know what sort
of computer system his workers prefer. He gives three choices: Windows, Mac, or Linux. Test the
hypothesis or theory that an equal percentage of the population prefers each type of computer
system.
H0 : The population distribution of the variable is the same as the proposed distribution
HA : The distributions are different
To calculate the Chi –Squared value for Windows go to cell D2 and type =((B2- C2)*(B2-C2))/C2
To calculate the Chi –Squared value for Mac go to cell D3 and type =((B3-C3)*(B3- C3))/C3
To calculate the Chi –Squared value for Mac go to cell D3 and type =((B4-C4)*(B4- C4))/C4 Go to
Cell D5 for and type=SUM(D2:D4)
To get the table value for Chi-Square for α = 0.05 and dof = 2, go to cell D7 and type
=CHIINV(0.05,2) At cell D8 type =IF(D5>D7, "H0 Accepted","H0 Rejected")

Output:

B) Perform testing of hypothesis using chi-squared test of independence.

In a study to understand the performance of M. Sc. IT Part -1 class, a college selects a random sample of
100 students. Each student was asked his grade obtained in B. Sc. IT. The sample is as given below

Output:
Null Hypothesis - H0 : The performance of girls students is same as boys students. Alternate Hypothesis
- H1 : The performance of boys and girls students are different.
Open Excel Workbook
Prepare a contingency table as shown above.
To calculate Girls Students with ‘O’ Grade Go to Cell N6 and type =COUNTIF($J$2:$K$40,"O") To
calculate Girls Students with ‘A’ Grade Go to Cell O6 and type =COUNTIF($J$2:$K$40,"A") To
calculate Girls Students with ‘B’ Grade Go to Cell P6 and type =COUNTIF($J$2:$K$40,"B") To
calculate Girls Students with ‘C’ Grade Go to Cell Q6 and type =COUNTIF($J$2:$K$40,"C") To
calculate Girls Students with ‘D’ Grade Go to Cell R6 and type =COUNTIF($J$2:$K$40,"D") To
calculate Boys Students with ‘O’ Grade Go to Cell N7 and type=COUNTIF($D$2:$E$62,"O") To
calculate Boys Students with ‘A’ Grade Go to Cell O7 and type=COUNTIF($D$2:$E$62,"A") To
calculate Boys Students with ‘B’ Grade Go to Cell P7 and type =COUNTIF($D$2:$E$62,"B") To
calculate Boys Students with ‘C’ Grade Go to Cell Q7 and type =COUNTIF($D$2:$E$62,"C") To
calculate Boys Students with ‘D’ Grade Go to Cell R7 and type =COUNTIF($D$2:$E$62,"D")
To calculate the expected value Ei
Go to Cell N9 and type =N8/2
Go to Cell O9 and type =O8/2
Go to Cell P9 and type =P8/2
Go to Cell Q9 and type =Q8/2
Go to Cell R9 and type =R8/2
Go to Cell S6 and calculate total girl students = SUM(N6:R6)
Go to Cell S7 and calculate total girl students = SUM(N7:R7)

Now Calculate
Go to cell T6 and type
=SUM((N6-$N$9)^2/$N$9,(O6-$O$9)^2/$O$9,(P6-$P$9)^2/$P$9,(Q6-Q$9)^2/$Q$9, (R6-
$R$9)^2/$R$9)
Go to cell T7 and type
=SUM((N7-$N$9)^2/$N$9,(O7-$O$9)^2/$O$9,(P7-$P$9)^2/$P$9,(Q7-Q$9)^2/$Q$9, (R7-
$R$9)^2/$R$9)
To get the table value go to cell T11 and type
=CHIINV(0.05,4) Go to cell O13 and type =IF(T8>=T11," H0 is Accepted", "H0 is Rejected")
OUTPUT:
USING EXCEL:
USING PYTHON:
import numpy as np
import pandas as pd
import scipy.stats as stats
np.random.seed(10)
stud_grade = np.random.choice(a=["O","A","B","C","D"], p=[0.20, 0.20 ,0.20, 0.20, 0.20], size=100)
stud_gen = np.random.choice(a=["Male","Female"], p=[0.5, 0.5], size=100)
mscpart1 = pd.DataFrame({"Grades":stud_grade, "Gender":stud_gen})
print(mscpart1)
stud_tab = pd.crosstab(mscpart1.Grades, mscpart1.Gender, margins=True)
stud_tab.columns = ["Male", "Female", "row_totals"]
stud_tab.index = ["O", "A", "B", "C", "D", "col_totals"]
observed = stud_tab.iloc[0:5, 0:2 ]
print(observed)
expected = np.outer(stud_tab["row_totals"][0:5],
stud_tab.loc["col_totals"][0:2]) / 100
print(expected)
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()
print('Calculated : ',chi_squared_stat)
crit = stats.chi2.ppf(q=0.95, df=4)
print('Table Value : ',crit)
if chi_squared_stat<= crit:
print('H0 is Accepted ')
else:
print('H0 is Rejected ')

OUTPUT:
Practical No: 5
Perform testing of hypothesis using Z-test.

Use a Z test if:

• Your sample size is greater than 30. Otherwise, use a t test.
• Data points should be independent from each other. In other words, one data point isn’t related or
doesn’t affect another data point.
• Your data should be normally distributed. However, for large sample sizes (over 30) this doesn’t always
matter.
• Your data should be randomly selected from a population, where each item has an equal chance of
being selected.
• Sample sizes should be equal if at all possible.
Ho - Blood pressure has a mean of 156 units
Program Code for one-sample Z test.

from statsmodels.stats import weightstats as stests

import pandas as pd
from scipy import stats
df = pd.read_csv("blood_pressure.csv")
df[['bp_before','bp_after']].describe()
print(df)
ztest ,pval = stests.ztest(df['bp_before'], x2=None, value=156)
print(float(pval))
if pval<0.05:
print("reject null hypothesis")
else:
print("accept null hypothesis")

Output:
Two-sample Z test- In two sample z-test , similar to t-test here we are checking two independent data
groups and deciding whether sample mean of two group is equal or not.

H0 : mean of two group is 0

H1 : mean of two group is not 0

## Two-sample Z test- In two sample z-test

import pandas as pd

from statsmodels.stats import weightstats as stests

df = pd.read_csv("blood_pressure.csv")

df[['bp_before','bp_after']].describe()

print(df)

ztest ,pval = stests.ztest(df['bp_before'], x2=df['bp_after'], value=0, alternative='two-sided')

print(float(pval))

if pval<0.05:

print("reject null hypothesis")

else:

print("accept null hypothesis")

Output:

Practical No: 6
A. Perform testing of hypothesis using One-way ANOVA.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
data = pd.read_csv("C:/Users/User-04/Documents/Dimple_Baroliya/RIC/prac6/
scores.csv")
data.head()
data['Borough'].value_counts()
############### There is no total score column, have to create it.
########In addition, find the mean score of the each district across all schools.
data['total_score'] = data['Average Score (SAT Reading)'] + \
data['Average Score (SAT Math)'] + \
data['Average Score (SAT Writing)']
data = data[['Borough', 'total_score']].dropna()
x = ['Brooklyn', 'Bronx', 'Manhattan', 'Queens', 'Staten Island']
district_dict = {}
#Assigns each test score series to a dictionary key
for district in x:
district_dict[district] = data[data['Borough'] == district]['total_score']
y = []
yerror = []
#Assigns the mean score and 95% confidence limit to each district
for district in x:
y.append(district_dict[district].mean())
yerror.append(1.96*district_dict[district].std()/np.sqrt(district_dict[district].shape[0]))
print(district + '_std : {}'.format(district_dict[district].std()))
sns.set(font_scale=1.8)
fig = plt.figure(figsize=(10,5))
#ax = sns.barplot(x, y, yerr=yerror)
#ax.set_ylabel('Average Total SAT Score')
#plt.show()
###################### Perform 1-way ANOVA
print(stats.f_oneway(
district_dict['Brooklyn'], district_dict['Bronx'], \
district_dict['Manhattan'], district_dict['Queens'], \
district_dict['Staten Island']
))
districts = ['Brooklyn', 'Bronx', 'Manhattan', 'Queens', 'Staten Island']
ss_b = 0
for d in districts:
ss_b += district_dict[d].shape[0] * \
np.sum((district_dict[d].mean() - data['total_score'].mean())**2)
ss_w = 0
for d in districts:
ss_w += np.sum((district_dict[d] - district_dict[d].mean())**2)
msb = ss_b/4
msw = ss_w/(len(data)-5)
f=msb/msw
print('F_statistic: {}'.format(f))
ss_t = np.sum((data['total_score']-data['total_score'].mean())**2)
eta_squared = ss_b/ss_t
print('eta_squared: {}'.format(eta_squared))
Output:

B. Perform testing of hypothesis using Two-way ANOVA.

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.graphics.factorplots import interaction_plot
import matplotlib.pyplot as plt
from scipy import stats
def eta_squared(aov):
aov['eta_sq'] = 'NaN'
aov['eta_sq'] = aov[:-1]['sum_sq']/sum(aov['sum_sq'])
return aov
def omega_squared(aov):
mse = aov['sum_sq'][-1]/aov['df'][-1]
aov['omega_sq'] = 'NaN'
aov['omega_sq'] = (aov[:-1]['sum_sq']-(aov[:-
1]['df']*mse))/(sum(aov['sum_sq'])+mse)
return aov
datafile = "ToothGrowth.csv"
data = pd.read_csv(datafile)
fig = interaction_plot(data.dose, data.supp, data.len,
colors=['red','blue'], markers=['D','^'], ms=10)
N = len(data.len)
df_a = len(data.supp.unique()) - 1
df_b = len(data.dose.unique()) - 1
df_axb = df_a*df_b
df_w = N - (len(data.supp.unique())*len(data.dose.unique()))
grand_mean = data['len'].mean()
#Sum of Squares A – supp
ssq_a = sum([(data[data.supp ==l].len.mean()-grand_mean)**2 for l in data.supp])
#Sum of Squares B – supp
ssq_b = sum([(data[data.dose ==l].len.mean()-grand_mean)**2 for l in data.dose])
#Sum of Squares Total
ssq_t = sum((data.len - grand_mean)**2)
vc = data[data.supp == 'VC']
oj = data[data.supp == 'OJ']
vc_dose_means = [vc[vc.dose == d].len.mean() for d in vc.dose]
oj_dose_means = [oj[oj.dose == d].len.mean() for d in oj.dose]
ssq_w = sum((oj.len - oj_dose_means)**2) +sum((vc.len - vc_dose_means)**2)
ssq_axb = ssq_t-ssq_a-ssq_b-ssq_w
ms_a = ssq_a/df_a #Mean Square A
ms_b = ssq_b/df_b #Mean Square B
ms_axb = ssq_axb/df_axb #Mean Square AXB
ms_w = ssq_w/df_w
f_a = ms_a/ms_w
f_b = ms_b/ms_w
f_axb = ms_axb/ms_w
p_a = stats.f.sf(f_a, df_a, df_w)
p_b = stats.f.sf(f_b, df_b, df_w)
p_axb = stats.f.sf(f_axb, df_axb, df_w)
results = {'sum_sq':[ssq_a, ssq_b, ssq_axb, ssq_w],
'df':[df_a, df_b, df_axb, df_w],
'F':[f_a, f_b, f_axb, 'NaN'],
'PR(>F)':[p_a, p_b, p_axb, 'NaN']}
columns=['sum_sq', 'df', 'F', 'PR(>F)']
aov_table1 = pd.DataFrame(results, columns=columns,
index=['supp', 'dose',
'supp:dose', 'Residual'])
formula = 'len ~ C(supp) + C(dose) + C(supp):C(dose)'
model = ols(formula, data).fit()
aov_table = anova_lm(model, typ=2)
eta_squared(aov_table)
omega_squared(aov_table)
print(aov_table.round(4))
res = model.resid
fig = sm.qqplot(res, line='s')
plt.show()

Output:
C. Perform testing of hypothesis using MANOVA.

MANOVA is the acronym for Multivariate Analysis of Variance. When analyzing data, we may encounter
situations where we have there multiple response variables (dependent variables). In MANOVA there
also some assumptions, like ANOVA. Before performing MANOVA we have to check the following
assumptions are satisfied or not.
• The samples, while drawing, should be independent of each other.
• The dependent variables are continuous in nature and the independent variables are categorical.
• The dependent variables should follow a multivariate normal distribution.
• The population variance-covariance matrices of each group are same, i.e. groups are homogeneous.
Code:
import pandas as pd
from statsmodels.multivariate.manova import MANOVA
df = pd.read_csv('C:/Users/User-04/Documents/Dimple_Baroliya/RIC/prac6/iris.csv',
index_col=0)
df.columns = df.columns.str.replace(".", "_")
df.head()
print('~~~~~~~~ Data Set ~~~~~~~~')
print(df)
maov = MANOVA.from_formula('Sepal_Length + Sepal_Width + \
Petal_Length + Petal_Width ~ Species', data=df)
print('~~~~~~~~ MANOVA Test Result ~~~~~~~~')
print(maov.mv_test())
Output:

Practical No: 7
A. Perform the Random sampling for the given data and analyse it.
Example 1: From a population of 10 women and 10 men as given in the table in Figure 1 on the
left below, create a random sample of 6 people for Group 1 and a periodic sample consisting of
every 3rd woman for Group 2.
You need to run the sampling data analysis tool twice, once to create Group 1 and again to create
Group 2. For Group 1 you select all 20 population cells as the Input Range and Random as the
Sampling Method with 6 for the Random Number of Samples. For Group 2 you select the 10
cells in the Women column as Input Range and Periodic with Period 3.
Open existing excel sheet with population data Sample Sheet looks as given below
Set Cell O1 = Male and Cell O2 = Female
To generate a random sample for male students from given population go to Cell O1 and type
=INDEX(E$2:E$62,RANK(B2,B$2:B$62))
Drag the formula to the desired no of cell to select random sample
Now, to generate a random sample for female students go to cell P1 and type
=INDEX(K$2:K$40,RANK(H2,H$2:H$40))
Drag the formula to the desired no of cell to select random sample

Output:
B. Perform the Stratified sampling for the given data and analyse it.
we are to carry out a hypothetical housing quality survey across Lagos state, Nigeria. And we looking at a
total of 5000 houses (hypothetically). We don’t just go to one local government and select 5000 houses,
rather we ensure that the 5000 houses are a representative of the whole 20 local government areas
Lagos state is comprised of. This is called stratified sampling. The population is divided into homogenous
strata and the right number of instances is sampled from each stratum to guarantee that the test-set
(which in this case is the 5000 houses) is a representative of the overall population. If we used random
sampling, there would be a significant chance of having bias in the survey results.

Command: C:\Users\User-04\AppData\Local\Programs\Python\Python39\Scripts>pip install

scikit-learn

Program Code:

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import sklearn
from sklearn.model_selection import train_test_split
housing =pd.read_csv('C:/Users/User-04/Documents/Dimple_Baroliya/RIC/prac7/housing.csv')
print(housing.head())
print(housing.info())
#creating a heatmap of the attributes in the dataset
correlation_matrix = housing.corr()
plt.subplots(figsize=(8,6))
sns.heatmap(correlation_matrix, center=0, annot=True, linewidths=.3)
corr =housing.corr()
print(corr['median_house_value'].sort_values(ascending=False))
sns.distplot(housing.median_income)
plt.show()

Output:
Practical No: 8
Write a program for computing different correlation.

Positive Correlation: Let’s take a look at a positive correlation. Numpy implements a corrcoef() function
that returns a matrix of correlations of x with x, x with y, y with x and y with y. We’re interested in the
values of correlation of x with y (so position (1, 0) or (0, 1)).

Code:

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
# 1000 random integers between 0 and 50
x = np.random.randint(0, 50, 1000)
# Positive Correlation with some noise
y = x + np.random.normal(0, 10, 1000)
np.corrcoef(x, y)
#matplotlib.style.use('ggplot')
plt.scatter(x, y)
plt.show()
Output:

Negative Correlation:

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
# 1000 random integers between 0 and 50
x = np.random.randint(0, 50, 1000)
# Negative Correlation with some noise
y = 100 - x + np.random.normal(0, 5, 1000)
np.corrcoef(x, y)
plt.scatter(x, y)
plt.show()

Output:
No/Weak Correlation:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
x = np.random.randint(0, 50, 1000)
y = np.random.randint(0, 50, 1000)
np.corrcoef(x, y)
plt.scatter(x, y)
plt.show()

Output:
Practical No: 9
A. Write a program to Perform linear regression for prediction.

## Practical 9 A Write a program to Perform linear regression for prediction.

import quandl, math

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from matplotlib import style
import datetime
style.use('ggplot')
df = quandl.get("WIKI/GOOGL")
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']]
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] * 100.0
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0
df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]
forecast_col = 'Adj. Close'
df.fillna(value=-99999, inplace=True)
forecast_out = int(math.ceil(0.01 * len(df)))
df['label'] = df[forecast_col].shift(-forecast_out)
X = np.array(df.drop(['label'], 1))
X = preprocessing.scale(X)
X_lately = X[-forecast_out:]
X = X[:-forecast_out]
df.dropna(inplace=True)
y = np.array(df['label'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = LinearRegression(n_jobs=-1)
clf.fit(X_train, y_train)
confidence = clf.score(X_test, y_test)
forecast_set = clf.predict(X_lately)
df['Forecast'] = np.nan
last_date = df.iloc[-1].name
last_unix = last_date.timestamp()
one_day = 86400
next_unix = last_unix + one_day

for i in forecast_set:
next_date = datetime.datetime.fromtimestamp(next_unix)
next_unix += 86400
df.loc[next_date] = [np.nan for _ in range(len(df.columns)-1)]+[i]
df['Adj. Close'].plot()
df['Forecast'].plot()
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()

Output:
B. Perform polynomial regression for prediction.

import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x, m_y = np.mean(x), np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return(b_0, b_1)

def plot_regression_line(x, y, b):

# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()

def main():
# observations
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} b_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()

Output:
Practical No: 10
A. Write a program for multiple linear regression analysis.

Step #1: Data Pre Processing

a) Importing The Libraries.
b) Importing the Data Set.
c) Encoding the Categorical Data.
d) Avoiding the Dummy Variable Trap.
e) Splitting the Data set into Training Set and Test Set.
Step #2: Fitting Multiple Linear Regression to the Training set Step #3: Predicting the Test set results

import numpy as np
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
def generate_dataset(n):
x = []
y = []
random_x1 = np.random.rand()
random_x2 = np.random.rand()
for i in range(n):
x1 = i
x2 = i/2 + np.random.rand()*n
x.append([1, x1, x2])
y.append(random_x1 * x1 + random_x2 * x2 + 1)
return np.array(x), np.array(y)
x, y = generate_dataset(200)
mpl.rcParams['legend.fontsize'] = 12
fig = plt.figure()
ax = fig.add_subplot(projection= '3d')
ax.scatter(x[:, 1], x[:, 2], y, label ='y', s = 5)
ax.legend()
ax.view_init(45, 0)
plt.show()
def mse(coef, x, y):
return np.mean((np.dot(x, coef) - y)**2)/2
def gradients(coef, x, y):
return np.mean(x.transpose()*(np.dot(x, coef) - y), axis = 1)
def multilinear_regression(coef, x, y, lr, b1 = 0.9, b2 = 0.999, epsilon = 1e-8):
prev_error = 0
m_coef = np.zeros(coef.shape)
v_coef = np.zeros(coef.shape)
moment_m_coef = np.zeros(coef.shape)
moment_v_coef = np.zeros(coef.shape)
t=0
while True:
error = mse(coef, x, y)
if abs(error - prev_error) <= epsilon:
break
prev_error = error
grad = gradients(coef, x, y)
t += 1
m_coef = b1 * m_coef + (1-b1)*grad
v_coef = b2 * v_coef + (1-b2)*grad**2
moment_m_coef = m_coef / (1-b1**t)
moment_v_coef = v_coef / (1-b2**t)

delta = ((lr / moment_v_coef**0.5 + 1e-8) *

(b1 * moment_m_coef + (1-b1)*grad/(1-b1**t)))
coef = np.subtract(coef, delta)
return coef
coef = np.array([0, 0, 0])
c = multilinear_regression(coef, x, y, 1e-1)
fig = plt.figure()
ax = fig.add_subplot(projection= '3d')
ax.scatter(x[:, 1], x[:, 2], y, label ='y',
s = 5, color ="dodgerblue")
ax.scatter(x[:, 1], x[:, 2], c[0] + c[1]*x[:, 1] + c[2]*x[:, 2],
label ='regression', s = 5, color ="orange")
ax.view_init(45, 0)
ax.legend()

Output:

B. Perform logistic regression analysis.

import os
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats as stats
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import metrics
matplotlib.style.use('ggplot')
plt.figure(figsize=(9,9))
def sigmoid(t): # Define the sigmoid function
return (1/(1 + np.e**(-t)))
plot_range = np.arange(-6, 6, 0.1)
y_values = sigmoid(plot_range)
# Plot curve
plt.plot(plot_range, # X-axis range
y_values, # Predicted values
color="red")
titanic_train = pd.read_csv("C:/Users/SONY/AppData/Local/Programs/Python/Python37-32/
titanic_train.csv") # Read the data
char_cabin = titanic_train["Cabin"].astype(str) # Convert cabin to str
new_Cabin = np.array([cabin[0] for cabin in char_cabin]) # Take first letter
titanic_train["Cabin"] = pd.Categorical(new_Cabin) # Save the new cabin var
# Impute median Age for NA Age values
new_age_var = np.where(titanic_train["Age"].isnull(), # Logical check

28, # Value if check is tru

titanic_train["Age"]) # Value if check is false
titanic_train["Age"] = new_age_var
label_encoder = preprocessing.LabelEncoder()
# Convert Sex variable to numeric
encoded_sex = label_encoder.fit_transform(titanic_train["Sex"])
# Initialize logistic regression model
log_model = linear_model.LogisticRegression()
# Train the model
log_model.fit(X = pd.DataFrame(encoded_sex),
y = titanic_train["Survived"])
# Check trained model intercept
print(log_model.intercept_)
# Check trained model coefficients
print(log_model.coef_)
# Make predictions
preds = log_model.predict_proba(X= pd.DataFrame(encoded_sex))
preds = pd.DataFrame(preds)
preds.columns = ["Death_prob", "Survival_prob"]
# Generate table of predictions vs Sex
pd.crosstab(titanic_train["Sex"], preds.loc[:, "Survival_prob"])
# Convert more variables to numeric
encoded_class = label_encoder.fit_transform(titanic_train["Pclass"])
encoded_cabin = label_encoder.fit_transform(titanic_train["Cabin"])
train_features = pd.DataFrame([encoded_class,
encoded_cabin,
encoded_sex,
titanic_train["Age"]]).T
# Initialize logistic regression model
log_model = linear_model.LogisticRegression()
# Train the model
log_model.fit(X = train_features ,
y = titanic_train["Survived"])
# Check trained model intercept
print(log_model.intercept_)
# Check trained model coefficients
print(log_model.coef_)
# Make predictions
preds = log_model.predict(X= train_features)
# Generate table of predictions vs actual
pd.crosstab(preds,titanic_train["Survived"])
log_model.score(X = train_features ,
y = titanic_train["Survived"])
metrics.confusion_matrix(y_true=titanic_train["Survived"], # True labels
y_pred=preds) # Predicted labels
# View summary of common classification metrics
print(metrics.classification_report(y_true=titanic_train["Survived"],
y_pred=preds) )
# Read and prepare test data
titanic_test = pd.read_csv("C:/Users/SONY/AppData/Local/Programs/Python/Python37-32/
titanic_test.csv") # Read the data
char_cabin = titanic_test["Cabin"].astype(str) # Convert cabin to str
new_Cabin = np.array([cabin[0] for cabin in char_cabin]) # Take first letter
titanic_test["Cabin"] = pd.Categorical(new_Cabin) # Save the new cabin var
# Impute median Age for NA Age values
new_age_var = np.where(titanic_test["Age"].isnull(), # Logical check
28, # Value if check is true
titanic_test["Age"]) # Value if check is false
titanic_test["Age"] = new_age_var
# Convert test variables to match model features
encoded_sex = label_encoder.fit_transform(titanic_test["Sex"])
encoded_class = label_encoder.fit_transform(titanic_test["Pclass"])
encoded_cabin = label_encoder.fit_transform(titanic_test["Cabin"])
test_features = pd.DataFrame([encoded_class,
encoded_cabin,encoded_sex,titanic_test["Age"]]).T

# Make test set predictions

test_preds = log_model.predict(X=test_features)
# Create a submission for Kaggle
submission = pd.DataFrame({"PassengerId":titanic_test["PassengerId"],
"Survived":test_preds})
# Save submission to CSV
submission.to_csv("tutorial_logreg_submission.csv",
index=False) # Do not save index values
print(pd)
Output:

Descriptive Stats & Data Analysis Guide
No ratings yet
Descriptive Stats & Data Analysis Guide
53 pages
Staff Manual 06
No ratings yet
Staff Manual 06
3 pages
Pratical 11 Python DP
No ratings yet
Pratical 11 Python DP
5 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
32 pages
DVA Lab Manual
No ratings yet
DVA Lab Manual
20 pages
Python Codes Test 2
No ratings yet
Python Codes Test 2
12 pages
Data Science Experiments
No ratings yet
Data Science Experiments
31 pages
Fha Unit 2
No ratings yet
Fha Unit 2
17 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Lab Manual (DAV)
No ratings yet
Lab Manual (DAV)
33 pages
07 Analysis of Variance
No ratings yet
07 Analysis of Variance
122 pages
Ankit Python
No ratings yet
Ankit Python
26 pages
Data Science Practicals
No ratings yet
Data Science Practicals
47 pages
DA Lab ANSWERS
No ratings yet
DA Lab ANSWERS
10 pages
Data Science
No ratings yet
Data Science
18 pages
Date Preparation and Exploration:: Titanic Data - CSV
No ratings yet
Date Preparation and Exploration:: Titanic Data - CSV
5 pages
Medium Com Sarowar Saurav10 20 Advanced Statistical Approaches Every Data Scientist Should Know Ccc70ae4df28
No ratings yet
Medium Com Sarowar Saurav10 20 Advanced Statistical Approaches Every Data Scientist Should Know Ccc70ae4df28
15 pages
DAV Practical
No ratings yet
DAV Practical
12 pages
Spss Manual
No ratings yet
Spss Manual
69 pages
Ad3411-Data Science and Analytics Laboratory
No ratings yet
Ad3411-Data Science and Analytics Laboratory
27 pages
4.49tybsc It
0% (2)
4.49tybsc It
12 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
DAV Short Notes
No ratings yet
DAV Short Notes
5 pages
Vanshika Goyal Gec Practicals
No ratings yet
Vanshika Goyal Gec Practicals
31 pages
Disha Data Science
No ratings yet
Disha Data Science
27 pages
CS3362 Data Science Laboratory Manual 2022-23
No ratings yet
CS3362 Data Science Laboratory Manual 2022-23
54 pages
CS F320 - Assignment II - Draft (Subject To A Few Changes in The Description of Problems)
No ratings yet
CS F320 - Assignment II - Draft (Subject To A Few Changes in The Description of Problems)
12 pages
Introduction To Chi-Square: O Is The Observed Frequency, and E Is The Expected Frequency
No ratings yet
Introduction To Chi-Square: O Is The Observed Frequency, and E Is The Expected Frequency
8 pages
Data Analysis Lab with Python
No ratings yet
Data Analysis Lab with Python
11 pages
Index: SR. NO. Practical Name Date of Perform NO. Sign
No ratings yet
Index: SR. NO. Practical Name Date of Perform NO. Sign
28 pages
Statistical Analysis With Scipy?
No ratings yet
Statistical Analysis With Scipy?
9 pages
Omkar
No ratings yet
Omkar
37 pages
Psychology Statistics
No ratings yet
Psychology Statistics
26 pages
23HCS4142 PDF
No ratings yet
23HCS4142 PDF
24 pages
Python For Data Sceince l1 Hands On
No ratings yet
Python For Data Sceince l1 Hands On
5 pages
DAFM
No ratings yet
DAFM
13 pages
Ric Manual Final
No ratings yet
Ric Manual Final
73 pages
WILP Assignment
No ratings yet
WILP Assignment
2 pages
Assignment STAT5002
No ratings yet
Assignment STAT5002
5 pages
Lab Manual
No ratings yet
Lab Manual
7 pages
PP DWDM 4 5
No ratings yet
PP DWDM 4 5
26 pages
Hands On With Probability and Statistical
No ratings yet
Hands On With Probability and Statistical
9 pages
Assignment2 2025spring S34
No ratings yet
Assignment2 2025spring S34
3 pages
Gec Practicals
No ratings yet
Gec Practicals
31 pages
Data Analytics Lab
No ratings yet
Data Analytics Lab
46 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
DABM LAB Manual .2025
No ratings yet
DABM LAB Manual .2025
86 pages
Labmanual For Mba
No ratings yet
Labmanual For Mba
36 pages
SPSS Introduction
No ratings yet
SPSS Introduction
46 pages
Artificial Intelligence & BA - Practicals Assignments
No ratings yet
Artificial Intelligence & BA - Practicals Assignments
15 pages
Chap2 Data
No ratings yet
Chap2 Data
101 pages
Easiest Lab Programs
No ratings yet
Easiest Lab Programs
5 pages
Labmanual For Mba
No ratings yet
Labmanual For Mba
23 pages
ML Updated File
No ratings yet
ML Updated File
36 pages
CS1B May
No ratings yet
CS1B May
9 pages
SAS Statistical Analysis Guide
No ratings yet
SAS Statistical Analysis Guide
11 pages
SI 03-Paramteric Tests Python I
No ratings yet
SI 03-Paramteric Tests Python I
22 pages
DAV Manual
No ratings yet
DAV Manual
15 pages
Biology 0610 Paper 4 MS
No ratings yet
Biology 0610 Paper 4 MS
11 pages
Parts Manual: Mechanical Unit
No ratings yet
Parts Manual: Mechanical Unit
240 pages
M 3 Problem Solving and Reasoning
No ratings yet
M 3 Problem Solving and Reasoning
10 pages
2023-Wireless Communications-CEP-Project
No ratings yet
2023-Wireless Communications-CEP-Project
4 pages
Lecture11 - 18374 - Zero Lecture CIV438
No ratings yet
Lecture11 - 18374 - Zero Lecture CIV438
52 pages
Introduction of Sludge Management
No ratings yet
Introduction of Sludge Management
154 pages
Strain Gauges For Integration in Fiber Composite Materials LI66
No ratings yet
Strain Gauges For Integration in Fiber Composite Materials LI66
2 pages
Power Systems Engineers Guide
No ratings yet
Power Systems Engineers Guide
7 pages
5.2 The Definite Integral HW
No ratings yet
5.2 The Definite Integral HW
15 pages
01 Paper 03 3D Geometry
No ratings yet
01 Paper 03 3D Geometry
2 pages
Electronic Cheat Sheet
No ratings yet
Electronic Cheat Sheet
1 page
Mineral Processing with CrossFlow
No ratings yet
Mineral Processing with CrossFlow
2 pages
Reg Pop Density
No ratings yet
Reg Pop Density
1 page
Column Base Plate Calculation Report
No ratings yet
Column Base Plate Calculation Report
13 pages
Whole-Body Vibration Therapy: An Overview
No ratings yet
Whole-Body Vibration Therapy: An Overview
6 pages
Roadmap To
No ratings yet
Roadmap To
58 pages
Discussion - A Technical Note - Derivation of The LRFD Column Design Equations
No ratings yet
Discussion - A Technical Note - Derivation of The LRFD Column Design Equations
2 pages
Oct-2023 Tybsc Cs 363 Web Technologies II
No ratings yet
Oct-2023 Tybsc Cs 363 Web Technologies II
2 pages
Grade 7/8 Carpentry Measurements
No ratings yet
Grade 7/8 Carpentry Measurements
14 pages
Civil Engineering Channel Flow
No ratings yet
Civil Engineering Channel Flow
22 pages
Class12 CS Practical File Slides Guidelines
No ratings yet
Class12 CS Practical File Slides Guidelines
12 pages
Quickassist Adapter 8950 Brief
No ratings yet
Quickassist Adapter 8950 Brief
3 pages
dataVAR LAAR
No ratings yet
dataVAR LAAR
1 page
Mechanical Seals Guide 2002
100% (1)
Mechanical Seals Guide 2002
91 pages
Complex Analysis
100% (1)
Complex Analysis
305 pages
Pharmaceuticals 18 00217
No ratings yet
Pharmaceuticals 18 00217
25 pages
200749205339
No ratings yet
200749205339
10 pages
Swiss GIS Projection Guide
No ratings yet
Swiss GIS Projection Guide
6 pages
Control Structures in PLSQL
No ratings yet
Control Structures in PLSQL
8 pages
MTP Responder Development Guide
No ratings yet
MTP Responder Development Guide
34 pages

Computing & Data Analysis Guide

Uploaded by

Computing & Data Analysis Guide

Uploaded by

Research in Computing

B. Perform analysis of given secondary data.

B) Perform testing of hypothesis using chi-squared test of independence.

Use a Z test if:

from statsmodels.stats import weightstats as stests

H0 : mean of two group is 0

H1 : mean of two group is not 0

## Two-sample Z test- In two sample z-test

from statsmodels.stats import weightstats as stests

ztest ,pval = stests.ztest(df['bp_before'], x2=df['bp_after'], value=0, alternative='two-sided')

print("reject null hypothesis")

print("accept null hypothesis")

B. Perform testing of hypothesis using Two-way ANOVA.

Command: C:\Users\User-04\AppData\Local\Programs\Python\Python39\Scripts>pip install

## Practical 9 A Write a program to Perform linear regression for prediction.

import quandl, math

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

def plot_regression_line(x, y, b):

Step #1: Data Pre Processing

delta = ((lr / moment_v_coef**0.5 + 1e-8) *

B. Perform logistic regression analysis.

28, # Value if check is tru

# Make test set predictions

You might also like