0% found this document useful (0 votes)

11 views16 pages

Notebook 4 - Machine Learning

This document is a reference guide for a data science project focused on using machine learning to analyze college data from the US Department of Education's College Scorecard Database. It covers the use of polynomial regression models to determine relationships between variables such as SAT scores and student loan default rates, and includes instructions for using R programming to conduct the analysis. The document emphasizes the importance of model tuning and the potential for improved fit through non-linear modeling techniques.

Uploaded by

goatlord962

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views16 pages

Notebook 4 - Machine Learning

Uploaded by

goatlord962

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Notebook 4 - Machine Learning

May 17, 2024

Reference Guide for R (student resource) - Check out our reference guide for a full listing
of useful R commands for this project.

0.1 Data Science Project: Use data to determine the best and worst colleges
for conquering student debt.
0.1.1 Notebook 4: Machine Learning
Does college pay off? We’ll use some of the latest data from the US Department of Education’s
College Scorecard Database to answer that question.
In this notebook (the 4th of 4 total notebooks), you’ll use R to add polynomial terms to your
multiple regression models (i.e. polynomial regression). Then, you’ll use the principles of machine
learning to tune models for a prediction task on unseen data.
[2]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code.
# This command downloads a useful package of R commands
library(coursekata)

0.1.2 The Dataset (four_year_colleges.csv)

General description - In this notebook, we’ll be using the four_year_colleges.csv file, which
only includes schools that offer four-year bachelors degrees and/or higher graduate degrees. Com-
munity colleges and trade schools often have different goals (e.g. facilitating transfers, direct career
education) than institutions that offer four-year bachelors degrees. By comparing four-year colleges
only to other four-year colleges, we’ll have clearer analyses and conclusions.
This data is a subset of the US Department of Education’s College Scorecard Database. The data
is current as of the 2020-2021 school year.
Description of all variables: See here
Detailed data file description: See here
[4]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code.
# This command downloads data from the file 'colleges.csv' and stores it in an␣
↪object called `dat`

dat <- read.csv('https://skewthescript.org/s/four_year_colleges.csv')

1
0.1.3 1.0 - Motivating non-linear regression
So far, we’ve focused entirely on linear regression and multiple linear regression models, which
use linear functions to relate predictors (e.g. net_tuition,grad_rate,pct_PELL) to the outcome
(default_rate).
In this notebook, we’re going to investigate ways to model non-linear relationships. To make this
task a bit more manageable at the start, let’s reduce the size of our dataset by taking a random
sample of 20 colleges from the dat dataframe. We will store our sample in a new R dataframe
called sample_dat.
[6]: ## Run this code but do not edit it
# create a dataset to train the model with 20 randomly selected observations
set.seed(2)
sample_dat <- sample(dat, size = 20)

Note: When getting a random sample, we’ll get different results each time we run our code because
it’s … well … random. This can be quite annoying. So, in the code above, we used the command
set.seed(2). This ensures that each time the code is executed, we get the same results for our
random sample - the results stored in seed 2. We could have also set the seed to 1 or 3 or 845 or
12345. The seed numbers serve merely as a unique ID that corresponds to a certain result from a
random draw. By setting a certain seed, we’ll always get a certain random draw.
1.1 Let’s take a look at our sample data set. Print out the head and dim of sample_dat.
[7]: # Your code goes here
head(sample_dat)

OPEID name city state region

<int> <chr> <chr> <chr> <chr>
975 379400 Saint Martin’s University Lacey WA Far West
710 316100 Northeastern State University Tahlequah OK Rockies & Southw
A data.frame: 6 × 27
774 330400 Muhlenberg College Allentown PA Northeast
416 231800 Spring Arbor University Spring Arbor MI Midwest
392 223400 Adrian College Adrian MI Midwest
273 189200 University of Iowa Iowa City IA Midwest

[8]: # Your code goes here

dim(sample_dat)

1. 20 2. 27
Check yourself: The dimensions of sample_dat should be 20 rows and 27 columns.
In prior notebooks, we focused on institutional and economic predictors of student loan default
rates. In this notebook, we’ll begin by analyzing an academic variable: SAT_avg. This variable
shows the average SAT score of students who matriculate to a college.
The following code creates a scatterplot of the relationship between SAT_avg (predictor) and
default_rate (outcome) from the dataset sample_dat:

2
[9]: ## Run this code but do not edit it
# create scatterplot: default_rate ~ SAT_avg
gf_point(default_rate ~ SAT_avg, data = sample_dat)

1.2 - Describe the direction of the relationship between SAT_avg and default_rate. Is it positive
or negative? Why do you think this is?
The relationship between SAT_avg and default_rate is negative. This is because as the average
SAT score of students matriculating to a college increases, the default rate on student loans tends
to decrease, suggesting that colleges with higher SAT averages have students who are better able
to pay back their loans.
1.3 - Create the same scatterplot as above, but with the simple linear model between default_rate
(outcome) and SAT_avg (predictor) overlayed on top.
Hint: Recall the gf_lm command from notebook 2.

[10]: # Your code goes here

gf_point(default_rate ~ SAT_avg, data = sample_dat) %>%
gf_lm()

3
1.4 - Would you say that this model provides a “good” fit for this dataset? Explain.
The model does not provide a particularly good fit as indicated by the scatterplot and the potential
low R^2 value. The linear model may not capture all the nuances and variability in the relationship
between SAT_avg and default_rate.
1.5 - Use the lm command to fit the linear regression model, where we use SAT_avg (predictor) to
predict default_rate (outcome) in the dataset sample_dat. Store the model in a variable named
sat_model_1 and use the summary command to print out information about the model fit.
[11]: # Your code goes here
sat_model_1 <- lm(default_rate ~ SAT_avg, data = sample_dat)
summary(sat_model_1)

Call:
lm(formula = default_rate ~ SAT_avg, data = sample_dat)

Residuals:
Min 1Q Median 3Q Max
-3.2133 -1.2490 -0.0765 0.6196 4.9881

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.480574 3.989422 4.883 0.00012 ***

4
SAT_avg -0.013315 0.003421 -3.893 0.00107 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.156 on 18 degrees of freedom

Multiple R-squared: 0.4571, Adjusted R-squared: 0.4269
F-statistic: 15.15 on 1 and 18 DF, p-value: 0.001067

Check yourself: The 𝑅2 value shown in the model summary should be 0.4571
1.6 - Does the model’s 𝑅2 value indicate that this model provides a strong fit for this dataset?
Explain.
The R^2 value of 0.4571 indicates a moderate fit, explaining about 45.71% of the variance in
default_rate based on SAT_avg. This suggests that while SAT_avg is a significant predictor,
there are other factors influencing default_rate that are not captured by this model.
1.7 - If this model were curved, rather than linear, do you believe the 𝑅2 could be higher? Explain.
Yes, if the model were curved (i.e., non-linear), the R^2 could be higher. A polynomial regression
might better capture the relationship between SAT_avg and default_rate, potentially leading to
a higher R^2 value.

0.1.4 2.0 - Polynomial regression

Recall that simple linear regression follows this formula:

𝑦 ̂ = 𝛽 0 + 𝛽1 𝑥
Where:
• 𝛽0 is the intercept
• 𝛽1 is the slope (coeﬀicient of 𝑥)
• 𝑦 ̂ is the predicted default_rate
• 𝑥 is the value of SAT_avg
If we want to capture the curvature in a scatter plot by creating a non-linear model, we can use
a technique called polynomial regression. For example, we could use a degree 2 polynomial
(quadratic), which looks like this:

𝑦 ̂ = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2

Where:
• 𝛽0 is the intercept
• 𝛽1 is the coeﬀicient of 𝑥 (linear term)
• 𝛽2 is the coeﬀicient of 𝑥2 (squared term)
• 𝑦 ̂ is the predicted default_rate

5
• 𝑥 is the SAT_avg
Below, we visualize the fit of this degree-2 polynomial (quadratic) model between SAT_avg and
default_rate:
[12]: ## Run this code but do not edit it
# create scatterplot: default_rate ~ SAT_avg, with degree 2 polynomial model␣
↪overlayed

gf_point(default_rate ~ SAT_avg, data = sample_dat) %>% gf_lm(formula = y ~␣

↪poly(x, 2), color = "orange")

2.1 - Make a prediction: Will this polynomial regression model have a higher or lower 𝑅2 value
than the linear regression model? Justify your reasoning.
The polynomial regression model is likely to have a higher R^2 value than the linear regression
model because it can capture the curvature in the relationship between SAT_avg and default_rate,
providing a better fit to the data
Let’s test your prediction. To do so, we’ll first need to fit the polynomial model. We can fit a
degree 2 polynomial to the data using the poly() function inside of the lm() function. Run the
cell below to see how it’s done.
[13]: ## Run this code but do not edit it
# degree 2 polynomial model for default_rate ~ SAT_avg
sat_model_2 <- lm(default_rate ~ poly(SAT_avg, 2), data = sample_dat)
sat_model_2

6
Call:
lm(formula = default_rate ~ poly(SAT_avg, 2), data = sample_dat)

Coefficients:
(Intercept) poly(SAT_avg, 2)1 poly(SAT_avg, 2)2
4.065 -8.391 4.355

The equation for this model would be

𝑦 ̂ = 4.065 − 8.391𝑥 + 4.355𝑥2

Where:
• 𝛽0 = 4.065 is the intercept
• 𝛽1 = −8.391 is the coeﬀicient of 𝑥, the linear term
• 𝛽2 = 4.355 is the coeﬀicient of 𝑥2 , the squared term
• 𝑦 ̂ is the predicted default_rate
• 𝑥 is the SAT_avg
2.2 - Use the summary command on sat_model_2 to see summary information about the quadratic
model.
[14]: # Your code goes here
sat_model_2 <- lm(default_rate ~ poly(SAT_avg, 2), data = sample_dat)
summary(sat_model_2)

Call:
lm(formula = default_rate ~ poly(SAT_avg, 2), data = sample_dat)

Residuals:
Min 1Q Median 3Q Max
-3.6183 -0.9604 0.1192 0.9562 3.9014

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0650 0.4361 9.321 4.3e-08 ***
poly(SAT_avg, 2)1 -8.3909 1.9504 -4.302 0.000483 ***
poly(SAT_avg, 2)2 4.3553 1.9504 2.233 0.039280 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.95 on 17 degrees of freedom

Multiple R-squared: 0.5802, Adjusted R-squared: 0.5308

7
F-statistic: 11.75 on 2 and 17 DF, p-value: 0.0006251

Check yourself: The 𝑅2 value shown in the model summary should be 0.5802
2.3 - How does this model’s 𝑅2 value compare to that of the linear model? Was your prediction
right? Explain.
The R^2 value for the quadratic model is 0.5802, which is higher than the R^2 value of the linear
model (0.4571). This confirms that the polynomial regression model provides a better fit for the
data.
This analysis raises a natural question: Why stop at degree 2? By raising the degree, we can add
more curves to our model, potentially better fitting the data! Let’s visualize what happens when
we increase the degree in our polynomial regression models.

Degree 3 Polynomial Model

𝑦 ̂ = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽3 𝑥3

[15]: ## Run this code but do not edit it

# create scatterplot: default_rate ~ SAT_avg, with degree 3 polynomial model␣
↪overlayed

gf_point(default_rate ~ SAT_avg, data = sample_dat) %>% gf_lm(formula = y␣

↪~poly(x, 3), color = "orange") + ylim(-4,12)

8
Degree 5 Polynomial Model

𝑦 ̂ = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝛽4 𝑥4 + +𝛽5 𝑥5

[16]: ## Run this code but do not edit it

# create scatterplot: default_rate ~ SAT_avg, with degree 5 polynomial model␣
↪overlayed

gf_point(default_rate ~ SAT_avg, data = sample_dat) %>% gf_lm(formula = y␣

↪~poly(x, 5), color = "orange") + ylim(-4,12)

Degree 12 Polynomial Model

𝑦 ̂ = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝛽4 𝑥4 + +𝛽5 𝑥5 + +𝛽6 𝑥6 + ... + 𝛽12 𝑥12

Note: The following code is pre-run, to save computer space.

[17]: ## Note: This code was pre-run, to save computer space
# create scatterplot: default_rate ~ SAT_avg, with degree 12 polynomial model␣
↪overlayed

# gf_point(default_rate ~ SAT_avg, data = sample_dat) %>% gf_smooth(method =␣

↪"lm", formula = y ~poly(x,12), color = "orange") + ylim(-4,14)

2.4 - Examine each plot for the polynmial models with degrees 3, 5, 12. Which model do you think
would have the largest 𝑅2 value? Why?

9
The degree 12 model would likely have the largest R^2 value because it is the most flexible and
can fit the data points more closely, capturing more of the variance in default_rate.
To determine which polynomial model fits the data the best, we will fit models for each degree (3,
5, 12).

[18]: ## Run this code but do not edit it

# degree 3, 5, and 12 polynomial models for default_rate ~ SAT_avg
sat_model_3 <- lm(default_rate ~ poly(SAT_avg, 3), data = sample_dat)
sat_model_5 <- lm(default_rate ~ poly(SAT_avg, 5), data = sample_dat)
sat_model_12 <- lm(default_rate ~ poly(SAT_avg, 12), data = sample_dat)

Now we can compare each model’s 𝑅2 value. Normally, we use the summary command and read the
𝑅2 value. However, since we’ve fit so many models, we don’t want to print out the entire summary
for each one.
Instead, we’ll use commands like this: summary(sat_model_1)$r.squared. The $ operator is used
to extract just the r.squared element from the full summary. We execute this command for each
model, then print the results for ease of comparison.
[19]: ## Run this code but do not edit it
# r-squared value for each model
r2_sat_model_1 <- summary(sat_model_1)$r.squared
r2_sat_model_2 <- summary(sat_model_2)$r.squared
r2_sat_model_3 <- summary(sat_model_3)$r.squared
r2_sat_model_5 <- summary(sat_model_5)$r.squared
r2_sat_model_12 <- summary(sat_model_12)$r.squared

# print each model's r-squared value

print(paste("The R squared value for the degree 1 model is", r2_sat_model_1))
print(paste("The R squared value for the degree 2 model is", r2_sat_model_2))
print(paste("The R squared value for the degree 3 model is", r2_sat_model_3))
print(paste("The R squared value for the degree 5 model is", r2_sat_model_5))
print(paste("The R squared value for the degree 12 model is", r2_sat_model_12))

[1] "The R squared value for the degree 1 model is 0.457055427196517"

[1] "The R squared value for the degree 2 model is 0.580193597490376"
[1] "The R squared value for the degree 3 model is 0.60314009577391"
[1] "The R squared value for the degree 5 model is 0.647445002110733"
[1] "The R squared value for the degree 12 model is 0.775449820710613"
Check yourself: The 𝑅2 for the degree 5 model should be about 0.647
2.5 - The degree 12 model has the highest 𝑅2 value. Does that mean it’s the “best” model? Why
or why not?
Hint: Think about which model would do the best for predicting the rest of the data from the
original full dataset.
The degree 12 model having the highest R^2 value does not necessarily mean it’s the “best” model.
While it fits the training data very well, it may overfit, meaning it captures noise along with the

10
underlying pattern, potentially performing poorly on new, unseen data.

0.1.5 3.0 - Prediction, model tuning, & machine learning

In prior notebooks, we’ve used our models to make inferences about default rates. However, some-
times in data science, we care more about predictions than we do about inferences. In particular,
many data science tasks ask for making accurate predictions on new data - data that hadn’t yet
been collected when we first fit the model. This process of building models to predict new data,
especially when it’s automated, is called machine learning.
The key to machine learning is building models that make accurate predictions on test data -
unseen data that weren’t used when fitting the model. Let’s see how this works. First, let’s create
a test dataset of 10 randomly sampled colleges. Importantly, these are colleges that our models
didn’t see while fitting:
[20]: ## Run this code but do not edit it
# create a data set to test the model with 10 new, randomnly selected␣
↪observations

# not used to train the model

set.seed(23)
test_dat <- sample(dat, size = 10)

3.1 - Use the head command on the test_dat data set.

[21]: # Your code goes here
head(test_dat)

OPEID name city state reg

<int> <chr> <chr> <chr> <ch
925 364600 Texas Woman’s University Denton TX Roc
284 1025600 Benedictine College Atchison KS Mid
A data.frame: 6 × 27
456 244900 Avila University Kansas City MO Mid
615 291400 Catawba College Salisbury NC Sou
913 363200 Texas A & M University-College Station College Station TX Roc
1015 2136600 Wisconsin Lutheran College Milwaukee WI Mid
The following code visualizes the new test data alongside the training data (the data we used to
originally fit our models).

[22]: ## Run this code but do not edit it

# label train and test sets
sample_dat$phase <- "train"
test_dat$phase <- "test"

# concatenate two datasets

full_dat <- rbind(sample_dat, test_dat)

# create scatterplot: default_rate ~ SAT_avg, with degree 5 polynomial model␣

↪overlayed

11
gf_point(default_rate ~ SAT_avg, data = full_dat, color = ~phase, shape =␣
↪~phase)

Warning message:
“The `scale_name` argument of `discrete_scale()` is deprecated as of
ggplot2
3.5.0.”

3.2 - Of all the polynomial models we fit before, which do you think would do best in predicting
the default rates in the test dataset?
Note: Use your gut and intution here. No calculations required.
The degree 5 polynomial model might do the best in predicting default rates in the test dataset,
as it balances flexibility and the risk of overfitting better than the higher-degree models.
Let’s see how good one of our models is at predicting default rates. The R code in the next cell
uses the predict function to make predictions on the test dataset. In this case, output shows the
predicted default rates for the 10 test set colleges, as predicted by our degree 5 model.
[23]: ## Run this code but do not edit it
# get predictions for degree 5 model
pred_deg5 <- predict(sat_model_5, newdata = data.frame(SAT_avg =␣
↪test_dat$SAT_avg))

pred_deg5

12
1 4.71195777211723 2 2.80117510935355 3 4.76610631136996 4 5.83135112143679 5
1.1525539865398 6 3.41638983019025 7 9.85190742866518 8 18.1060265766056 9
1.29744229099948 10 4.28267983941371
So, how can we interpret these values? Well, the last college in our test set is University of New
Orleans, which has an SAT_avg value of 1088 and a default_rate of 6.6. It’s shown here on the
graph, alongside our degree 5 model.
Note: The following code is pre-run.
[ ]: ### Run this code but do not edit it
## create scatterplot: default_rate ~ SAT_avg, with degree 5 polynomial model␣
↪overlayed

#gf_point(default_rate ~ SAT_avg, data = sample_dat, color = ~phase, shape =␣

↪~phase) %>% gf_lm(formula = y ~poly(x, 5), color = "orange") %>%␣

↪gf_point(default_rate ~ SAT_avg, data = full_dat, color = ~phase, shape =␣

↪~phase) + ylim(0,19)

Our degree 5 model’s predicted default rate for this first data point was 4.28. That means that
our degree 5 model under-estimates the actual value for default rate by…

6.6 − 4.28 = 2.32

The model’s prediction and error is visualized in the plot below:

So, its predicted default rate is “off” by about 2 percentage points! This is pretty amazing, con-
sidering the model had only 20 training data values and the University of New Orleans wasn’t
included among them. This is the power of machine learning! Predicting previously unseen data!
This is just one prediction. We’re really interested in how this model performed across all its
predictions. For that, let’s measure its 𝑅2 (prediciton strength) on the test set!
We can use the cor function to correlate the predictions with the actual default rates (𝑟) and then
square that value to get 𝑅2 , which gets us the prediction strength!
[24]: ## Run this code but do not edit it
# Get correlation between predicted and actual default rates in test set
cor(test_dat$default_rate, pred_deg5) ^ 2

0.616546630372038
We can now repeat this same process for all polynomial degrees.
[25]: ## Run this code but do not edit it
# Storing test set predictions for all models
pred_deg1 <- predict(sat_model_1, newdata = data.frame(SAT_avg =␣
↪test_dat$SAT_avg))

pred_deg2 <- predict(sat_model_2, newdata = data.frame(SAT_avg =␣

↪test_dat$SAT_avg))

pred_deg3 <- predict(sat_model_3, newdata = data.frame(SAT_avg =␣

↪test_dat$SAT_avg))

13
pred_deg5 <- predict(sat_model_5, newdata = data.frame(SAT_avg =␣
↪test_dat$SAT_avg))

pred_deg12 <- predict(sat_model_12, newdata = data.frame(SAT_avg =␣

↪test_dat$SAT_avg))

# print each model's r-squared value

print(paste("The test R squared value for the degree 1 model is",␣
↪cor(test_dat$default_rate, pred_deg1) ^ 2))

print(paste("The test R squared value for the degree 2 model is",␣

↪cor(test_dat$default_rate, pred_deg2) ^ 2))

print(paste("The test R squared value for the degree 3 model is",␣

↪cor(test_dat$default_rate, pred_deg3) ^ 2))

print(paste("The test R squared value for the degree 5 model is",␣

↪cor(test_dat$default_rate, pred_deg5) ^ 2))

print(paste("The test R squared value for the degree 12 model is",␣

↪cor(test_dat$default_rate, pred_deg12) ^ 2))

[1] "The test R squared value for the degree 1 model is 0.55824446697698"
[1] "The test R squared value for the degree 2 model is 0.70025122602337"
[1] "The test R squared value for the degree 3 model is 0.733729012084851"
[1] "The test R squared value for the degree 5 model is 0.616546630372038"
[1] "The test R squared value for the degree 12 model is 0.176121561427524"
Check yourself: The 𝑅2 for the degree 5 model should be about 0.6165
3.3 - Compare the 𝑅2 estimates for each model. Which models did well? Which models did poorly?
Why do you think this is?
The lower-degree polynomial models likely performed better because they balance fitting the train-
ing data and generalizing to new data, avoiding overfitting. The higher-degree models likely per-
formed poorly on the test data due to overfitting.
3.4 - In machine learning, the central goal is to build our models so as to avoid “underfitting” and
“overfitting” our models to the training data. What do you think these terms mean? Which of our
models were underfit? Which do you think were overfit? Explain.
Underfitting occurs when a model is too simple to capture the underlying pattern in the data,
while overfitting occurs when a model is too complex and captures noise along with the underlying
pattern. The linear model likely underfit the data, while the degree 12 polynomial model likely
overfit the data.
Recall that we built our polynomial models here with just one predictor: 𝑥 (SAT_avg). Yet, those
models could end up being quite complex…

𝑦 ̂ = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽3 𝑥3

Now, imagine that we wanted to bring in multiple predictors (𝑥1 = SAT_avg, 𝑥2 = net_tuition,
𝑥3 = grad_rate) for muliple regression. Plus, imagine that we decided to add in some polynomial
terms for each of these predictors. We could end up with a model that looks ever more complicated,
with literally hundreds of terms…

14
𝑦 ̂ = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥21 + 𝛽3 𝑥31 + 𝛽4 𝑥2 + 𝛽5 𝑥22 + 𝛽5 𝑥32 + 𝛽6 𝑥3 + 𝛽7 𝑥23 + ...

3.5 - Is it always good to add more predictors and add more polynomial terms to your model?
Explain why or why not.
No, adding more predictors and polynomial terms is not always good because it can lead to overfit-
ting, where the model performs well on training data but poorly on new data. The goal is to find a
balance where the model is complex enough to capture the underlying pattern but not so complex
that it captures noise.

0.1.6 4.0 - In-class prediction competition

Now you have all the tools you need to build very powerful prediction models! This means that
it’s time for a friendly competition :)
The code below takes the full dataset and splits it into larger train and test datasets. 80% of the
colleges will go into the train dataset. 20% will go into the test dataset. Because we all are setting
the same seed (2024), everyone will get the exact same train and test sets:

[26]: ## Run but do not edit this code

# set training data to be 80% of all colleges

train_size <- floor(0.8 * nrow(dat))

## sample row indeces

set.seed(2024)
train_ind <- sample(seq_len(nrow(dat)), size = train_size)

train <- dat[train_ind, ]

test <- dat[-train_ind, ]

[27]: dim(train)

1. 842 2. 26
[28]: dim(test)

1. 211 2. 26
Now it’s time to compete!
Goal: Create the most accurate prediction model of colleges’ default rates.
Evaluation: Whichever student has the highest 𝑅2 on the test set wins.
Guidelines Save your best model as an object called my_model. You are only allowed to fit models
on the train set (not on the test set). You may use as many predictors and as many polynomial
terms as you’d like. Just be warned: Don’t fall into the trap of overfitting! Choose only the most
important variables and keep your models simple, so that you can generalize well to the test set.
Periodically test your model on the test set and then make adjustments as necessary.

15
Go!
[37]: ## Your code goes here
# Replace 'grad_rate + poly(SAT_avg,3)' with your own combo of variables and␣
↪poly terms

my_model <- lm(default_rate ~ net_tuition + poly(SAT_avg, 2), data = train)

[38]: # run this code to get the R^2 value on the test set from your model
test_predictions = predict(my_model, newdata = test)
print(paste("The test R^2 value was: ", cor(test$default_rate,␣
↪test_predictions) ^ 2))

[1] "The test R^2 value was: 0.552628396000714"

0.1.7 5.0 - NATIONWIDE prediction competition

Competition: We’re hosting a nationwide competition to see which student can build the best
model for predicting student loan default rates at different colleges. Here’s an article about last
year’s winners.
Evaluation: Across the country, all students are using the same train and test sets as you did in
the prior exercise to fit and evaluate their models. Your goal: Build a model that gives the best
predictions on this test set. The student models that produce the highest 𝑅2 value on the test set
will be announced as champions!
Submission Process (due June 7, 2024 at 11:59pm CT): 1. Print and have a parent/guardian
sign the media release form. This form gives permission to feature you and publish your results,
in the event that you’re a finalist! Take a picture or scan the signed form and submit it during
as a part of Step #2 (below). 2. Submit this google form (note: you’ll have to log into a google
account), which allows you to upload your media release form, model, and notebook. This counts
as your final submission.
Notes to avoid disqualification: - Do not change the seed (2024) in the code block that splits
the data into the train and test sets. Using the common seed of 2024 will ensure everyone across
the country has the exact same train/test split. - Make sure your model is fit using the train data.
In other words, it should look like: my_model <- lm(default_rate ~ ..., data = train). -
Make sure your find the 𝑅2 value on the test data, using the provided code. - There are ways
to “cheat” on this competition by looking directly at the test set data values and designing your
model to predict those values exactly (or approximately). However, based on the design of your
model (which we’ll see when you share your notebook), it’s pretty easy for us to tell if you’ve done
this. So, don’t do it! Your submission will be discarded.

0.1.8 Feedback (Required)

Please take 2 minutes to fill out this anonymous notebook feedback form, so we can continue
improving this notebook for future years!

Notebook 4 - Machine Learning
No ratings yet
Notebook 4 - Machine Learning
17 pages
Machine Learning Training Using R AP Statistics
No ratings yet
Machine Learning Training Using R AP Statistics
12 pages
Student Debt Analysis Guide
No ratings yet
Student Debt Analysis Guide
11 pages
Notebook 2 - Linear Regression
No ratings yet
Notebook 2 - Linear Regression
11 pages
Notebook 3 - Multiple Regression
No ratings yet
Notebook 3 - Multiple Regression
11 pages
Https Tutorials Iq Harvard Edu R Rstatistics Rstatistics HTML
No ratings yet
Https Tutorials Iq Harvard Edu R Rstatistics Rstatistics HTML
25 pages
Notebook 3 - Multiple Regression
No ratings yet
Notebook 3 - Multiple Regression
10 pages
Regresssion Analysis
No ratings yet
Regresssion Analysis
19 pages
Homework 2
100% (1)
Homework 2
14 pages
Polynomial & Categorical Regression
No ratings yet
Polynomial & Categorical Regression
10 pages
DS535 Note 4 (With Marks)
No ratings yet
DS535 Note 4 (With Marks)
18 pages
Lesson Week 13
No ratings yet
Lesson Week 13
6 pages
Building Regression Models
No ratings yet
Building Regression Models
22 pages
Simple Regression Model Fitting
No ratings yet
Simple Regression Model Fitting
5 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
7 pages
R For Marketing Research and Analytics
No ratings yet
R For Marketing Research and Analytics
47 pages
MIT 302 - Statistical Computing II - Tutorial 03
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 03
16 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
R Multiple Regression Exercise 2019
No ratings yet
R Multiple Regression Exercise 2019
6 pages
Lecture 10
No ratings yet
Lecture 10
5 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Econometrics Cheat Sheet
No ratings yet
Econometrics Cheat Sheet
4 pages
Objects Oriented Programming OOP
No ratings yet
Objects Oriented Programming OOP
66 pages
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
No ratings yet
DMV Unit 3 PPT - RSK - 250419 - 125620 Jfhuehiwhu
89 pages
22 Linear Fit Post
No ratings yet
22 Linear Fit Post
7 pages
Solutions 3
No ratings yet
Solutions 3
12 pages
In Sem 2 Study Material
No ratings yet
In Sem 2 Study Material
19 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
Lesson 4 8 Answer Key AP Precalculus Math Medic Db114f9b6f
No ratings yet
Lesson 4 8 Answer Key AP Precalculus Math Medic Db114f9b6f
3 pages
M348 Applied Statistical Modelling - Linear Models
No ratings yet
M348 Applied Statistical Modelling - Linear Models
504 pages
Assignment No.4 - (20-Ele-68)
No ratings yet
Assignment No.4 - (20-Ele-68)
17 pages
Econometrics With R
No ratings yet
Econometrics With R
56 pages
Objects Oriented Programming OOP
No ratings yet
Objects Oriented Programming OOP
67 pages
PS4 PDF
No ratings yet
PS4 PDF
10 pages
CS ELEC 4 Finals Module
No ratings yet
CS ELEC 4 Finals Module
57 pages
Advanced - Linear Regression
No ratings yet
Advanced - Linear Regression
57 pages
Lecture 01
No ratings yet
Lecture 01
26 pages
EC501 Lecture 04
No ratings yet
EC501 Lecture 04
30 pages
Logit Regression - R Data Analysis Examples
No ratings yet
Logit Regression - R Data Analysis Examples
12 pages
Regression
No ratings yet
Regression
5 pages
Unit5 R
No ratings yet
Unit5 R
5 pages
Lab 4 Classification v.0
No ratings yet
Lab 4 Classification v.0
5 pages
cs447 - Tool Making Predictions With Simple Linear Regression
No ratings yet
cs447 - Tool Making Predictions With Simple Linear Regression
5 pages
Mindanao State University General Santos City: Simple Linear Regression
No ratings yet
Mindanao State University General Santos City: Simple Linear Regression
12 pages
R Lab 4
No ratings yet
R Lab 4
7 pages
Linear Regression
No ratings yet
Linear Regression
17 pages
Regression Diagnostics With R: Anne Boomsma
No ratings yet
Regression Diagnostics With R: Anne Boomsma
23 pages
Linear Regression and Qualitative Predictors Analysis
No ratings yet
Linear Regression and Qualitative Predictors Analysis
66 pages
ISLP - Website-135-200 (1) - 1-60
No ratings yet
ISLP - Website-135-200 (1) - 1-60
60 pages
STAT-2450 Assignment 1: Name:, Student ID: B00
No ratings yet
STAT-2450 Assignment 1: Name:, Student ID: B00
9 pages
Sta108hw4 1
No ratings yet
Sta108hw4 1
5 pages
Second Course in Statistics Regression Analysis 7th Edition Mendenhall Solutions Manual Instant Download
100% (5)
Second Course in Statistics Regression Analysis 7th Edition Mendenhall Solutions Manual Instant Download
48 pages
Module 4: Recommended Exercises: Problem 1: KNN (Exercise 2.4.7 in ISL Textbook, Slightly Modified)
No ratings yet
Module 4: Recommended Exercises: Problem 1: KNN (Exercise 2.4.7 in ISL Textbook, Slightly Modified)
6 pages
Homework 10: R Markdown
No ratings yet
Homework 10: R Markdown
39 pages
Lab 5
No ratings yet
Lab 5
6 pages
Data Science for Beginners
No ratings yet
Data Science for Beginners
98 pages
Chapter 1
No ratings yet
Chapter 1
24 pages
Pbset1 Dofile
No ratings yet
Pbset1 Dofile
3 pages
ANZ J. Surg. 2008 78 (Suppl. 1) A68-A80
No ratings yet
ANZ J. Surg. 2008 78 (Suppl. 1) A68-A80
13 pages
SK1-BRK-01-Brake System Bleeding-Rev 1.0
No ratings yet
SK1-BRK-01-Brake System Bleeding-Rev 1.0
9 pages
Wearable Devices For The Detection of Covid-19
No ratings yet
Wearable Devices For The Detection of Covid-19
21 pages
3.4 MOP Setpoint
No ratings yet
3.4 MOP Setpoint
4 pages
Course Syllabus - Upper Intermediate - Spring - 2022
No ratings yet
Course Syllabus - Upper Intermediate - Spring - 2022
3 pages
PFC 4197
No ratings yet
PFC 4197
114 pages
Determinants of The Money Supply: © 2005 Pearson Education Canada Inc
No ratings yet
Determinants of The Money Supply: © 2005 Pearson Education Canada Inc
17 pages
Oops (Object Oriented Programming System) : Object Class Inheritance Polymorphism Abstraction Encapsulation
No ratings yet
Oops (Object Oriented Programming System) : Object Class Inheritance Polymorphism Abstraction Encapsulation
65 pages
Lesson 5 Freedom of The Human Person
No ratings yet
Lesson 5 Freedom of The Human Person
16 pages
Extra-Creamy Scrambled Eggs Recipe - NYT Cooking
No ratings yet
Extra-Creamy Scrambled Eggs Recipe - NYT Cooking
2 pages
Goat Housing Design Guide
No ratings yet
Goat Housing Design Guide
2 pages
PeriUrja Company Profile
No ratings yet
PeriUrja Company Profile
10 pages
Reto 4
No ratings yet
Reto 4
5 pages
Marine Crane Failure Analysis
100% (1)
Marine Crane Failure Analysis
27 pages
Cre Project Group 2 Sec01
No ratings yet
Cre Project Group 2 Sec01
28 pages
Education, Arts, and Sciences
No ratings yet
Education, Arts, and Sciences
1 page
Latin American Veggie Meal Plan
No ratings yet
Latin American Veggie Meal Plan
2 pages
Review of Invisalign System
No ratings yet
Review of Invisalign System
13 pages
Tomato Processing Guide by Mynampati Sreenivasa Rao
No ratings yet
Tomato Processing Guide by Mynampati Sreenivasa Rao
4 pages
Reoi Construction Supervision Services Leseru-Kitale Morpus-Lokichar - 28.3.2025
100% (1)
Reoi Construction Supervision Services Leseru-Kitale Morpus-Lokichar - 28.3.2025
3 pages
Assignment MHDD 160
No ratings yet
Assignment MHDD 160
2 pages
UV Stable Waterproof Membrane Guide
No ratings yet
UV Stable Waterproof Membrane Guide
3 pages
Sodium Chloride Nacl Data Sheet
No ratings yet
Sodium Chloride Nacl Data Sheet
1 page
Sample ICT Action Plan
100% (2)
Sample ICT Action Plan
2 pages
Medical Forms The High School Programme 2020-21
No ratings yet
Medical Forms The High School Programme 2020-21
4 pages
Agriengineering 06 00187
No ratings yet
Agriengineering 06 00187
18 pages
The Famished Road
No ratings yet
The Famished Road
91 pages
A General Theory of Domination and Justice 1st Edition Lovett Instant Download
No ratings yet
A General Theory of Domination and Justice 1st Edition Lovett Instant Download
145 pages
NPTEL CC Assignment 8
50% (2)
NPTEL CC Assignment 8
4 pages
EMCP4.1 4.2 M05 CANExtMods EN INS
No ratings yet
EMCP4.1 4.2 M05 CANExtMods EN INS
14 pages

Notebook 4 - Machine Learning

Uploaded by

Notebook 4 - Machine Learning

Uploaded by

Notebook 4 - Machine Learning

May 17, 2024

0.1.2 The Dataset (four_year_colleges.csv)

dat <- read.csv('https://skewthescript.org/s/four_year_colleges.csv')

OPEID name city state region

[8]: # Your code goes here

[10]: # Your code goes here

Residual standard error: 2.156 on 18 degrees of freedom

0.1.4 2.0 - Polynomial regression

gf_point(default_rate ~ SAT_avg, data = sample_dat) %>% gf_lm(formula = y ~␣

The equation for this model would be

𝑦 ̂ = 4.065 − 8.391𝑥 + 4.355𝑥2

Residual standard error: 1.95 on 17 degrees of freedom

Degree 3 Polynomial Model

[15]: ## Run this code but do not edit it

gf_point(default_rate ~ SAT_avg, data = sample_dat) %>% gf_lm(formula = y␣

[16]: ## Run this code but do not edit it

gf_point(default_rate ~ SAT_avg, data = sample_dat) %>% gf_lm(formula = y␣

Degree 12 Polynomial Model

𝑦 ̂ = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝛽4 𝑥4 + +𝛽5 𝑥5 + +𝛽6 𝑥6 + ... + 𝛽12 𝑥12

Note: The following code is pre-run, to save computer space.

# gf_point(default_rate ~ SAT_avg, data = sample_dat) %>% gf_smooth(method =␣

[18]: ## Run this code but do not edit it

# print each model's r-squared value

[1] "The R squared value for the degree 1 model is 0.457055427196517"

0.1.5 3.0 - Prediction, model tuning, & machine learning

# not used to train the model

3.1 - Use the head command on the test_dat data set.

OPEID name city state reg

[22]: ## Run this code but do not edit it

# concatenate two datasets

# create scatterplot: default_rate ~ SAT_avg, with degree 5 polynomial model␣

#gf_point(default_rate ~ SAT_avg, data = sample_dat, color = ~phase, shape =␣

↪gf_point(default_rate ~ SAT_avg, data = full_dat, color = ~phase, shape =␣

6.6 − 4.28 = 2.32

The model’s prediction and error is visualized in the plot below:

pred_deg2 <- predict(sat_model_2, newdata = data.frame(SAT_avg =␣

pred_deg3 <- predict(sat_model_3, newdata = data.frame(SAT_avg =␣

pred_deg12 <- predict(sat_model_12, newdata = data.frame(SAT_avg =␣

# print each model's r-squared value

print(paste("The test R squared value for the degree 2 model is",␣

print(paste("The test R squared value for the degree 3 model is",␣

print(paste("The test R squared value for the degree 5 model is",␣

print(paste("The test R squared value for the degree 12 model is",␣

0.1.6 4.0 - In-class prediction competition

[26]: ## Run but do not edit this code

# set training data to be 80% of all colleges

## sample row indeces

train <- dat[train_ind, ]

my_model <- lm(default_rate ~ net_tuition + poly(SAT_avg, 2), data = train)

[1] "The test R^2 value was: 0.552628396000714"

0.1.7 5.0 - NATIONWIDE prediction competition

0.1.8 Feedback (Required)

You might also like