Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
16 views7 pages

Regression Stat Assignment

Uploaded by

Pritom Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views7 pages

Regression Stat Assignment

Uploaded by

Pritom Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

#Statistics Assignment: Generating Regression Model

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress
from scipy.optimize import curve_fit

Here are all the imports for the regression model.

• numpy for number crunching


• matplotlib for ploting the model
• scipy.stats for straight line regression model generator
• scipy.optimize for curvy regression model

From http://www.statsci.org/data/general/brunhild.html, a dataset that measures the


concentration of a sulfate in the blood of a baboon named Brunhilda as a function of time was
found. The data table is presented here:

Hours Sulfate
2 15.11
4 11.36
6 9.77
8 9.09
10 8.48
15 7.69
20 7.33
25 7.06
30 6.7
40 6.43
50 6.16
60 5.99
70 5.77
80 5.64
90 5.39
110 5.09
130 4.87
150 4.6
160 4.5
170 4.36
180 4.27

Lets represent the data table as two numpy arrays for further mathematical queries

hours = np.array([2, 4, 6, 8, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80,
90, 110, 130, 150, 160, 170, 180])
sulfate = np.array([15.11, 11.36, 9.77, 9.09, 8.48, 7.69, 7.33, 7.06,
6.7, 6.43, 6.16, 5.99, 5.77, 5.64, 5.39, 5.09, 4.87, 4.6, 4.5, 4.36,
4.27])

#Task 1: Prepare a plot showing :

1. the data points and


2. the regression line in log-log coordinates.

To plot the data points in log-log coordinates we first plug in the hours and sulfate values to
numpy.log() function. Which returns log of each data-points.

log_hours = np.log(hours)
log_sulfate = np.log(sulfate)

Now we plot the log-log data points.

plt.scatter(log_hours, log_sulfate)
plt.xlabel('Log(Hours)')
plt.ylabel('Log(Sulfate)')
plt.title('Log-Log Plot of Sulfate Concentration vs. Time')
plt.grid(True)
For the regression line we use a tool from scipy.stats called linregress() function. This
function returns in the slope and the y-intercept of the model it is predicting.

slope, intercept, _, _, _ = linregress(log_hours, log_sulfate)

then, just plot the straightline.

plt.scatter(log_hours, log_sulfate)
plt.plot(log_hours, slope * log_hours + intercept, color='red',
label='Regression Line')
plt.xlabel('Log(Hours)')
plt.ylabel('Log(Sulfate)')
plt.title('Log-Log Plot of Sulfate Concentration vs. Time')
plt.grid(True)
plt.show()

#Task 2: Prepare a plot showing -

1. the data points and


2. the regression curve in the original coordinates.

First, plot the data points as is:


plt.scatter(hours, sulfate, label='Data Points')
plt.xlabel('Hours')
plt.ylabel('Sulfate')
plt.title('Plot of Sulfate Concentration vs. Time')
plt.grid(True)

Here, for the regression curve we need to use curve_fit() function from scipy.optimize.
We need to assume the type of curve e.g. sin/tan/y = mx+c/exponential.

For example: In this dataset, the points resembles exponential function. So we assume the
function to be,

this could be written in python like:

def demo_exp_function(x, a, b, c):


return a * np.exp(-b * x) + c

Now, we pass this fuction and the data points inside curve_fit() function. It will return in the
constants, for this case:
a, b and c are the constants.

constants, _ = curve_fit(demo_exp_function, hours, sulfate)

plotting the regression curve:

plt.scatter(hours, sulfate, label='Data Points')


plt.plot(hours, demo_exp_function(hours, *constants), color='red',
label='Regression Curve')
plt.xlabel('Hours')
plt.ylabel('Sulfate')
plt.title('Plot of Sulfate Concentration vs. Time')
plt.grid(True)

#Task 3: Plot the residual against the fitted values in log-log and in original coordinates.

The residual is the difference between the observed value of the dependent variable (in this case,
the sulfate concentration) and the value predicted by the regression model. In other words, it
represents the error or deviation of each data point from the fitted regression line or curve.
regression_line = slope * log_hours + intercept
residual = log_sulfate - regression_line

Now we plot residual vs fitted values:

plt.scatter(regression_line, residual)
plt.xlabel('Fitted Values (Log)')
plt.ylabel('Residual (Log)')
plt.title('Plot 5: Residual vs Fitted Values (Log-Log)')
plt.grid(True)

for the original coordinates it is the same.

regression_curve = demo_exp_function(hours, *constants)


residual_original_data_points = sulfate - regression_curve

Then, we plot Residual vs Fitted Values (Original)

plt.scatter(regression_curve, residual_original_data_points)
plt.xlabel('Fitted Values (Original)')
plt.ylabel('Residual (Original)')
plt.title('Plot 6: Residual vs Fitted Values (Original)')
plt.grid(True)

Task 4:
Use your plots to explain whether your regression is good or bad and why.

From plot 5: it is the regression line we previously calculated. A regression is Good or Bad is
determined by the residual. Here from plot 5 we can see our calculated residuals.

In this plot, if the regression model is a good fit, we would expect the residuals to be randomly
scattered around zero, indicating that the model captures the variation in the data well. The
residuals are close to zero and they are well distributed. So this model is a very GOOD model;

From plot 6: it is the regression curve for original data points. A regression is Good or Bad is
determined by the residual. Here from plot 6 we can see our calculated residuals.

Here in plot 6 the residuals are not close to zero and there is a density of residuals around (4-6)
along with Fitted values. This one is not a good model. In fact this one is pretty BAD at
predicting the future values.

You might also like