Introducing Grid
Search
H Y P E R PA R A M E T E R T U N I N G I N P Y T H O N
Alex Scriven
Data Scientist
Automating 2 Hyperparameters
Your previous work:
neighbors_list = [3,5,10,20,50,75]
accuracy_list = []
for test_number in neighbors_list:
model = KNeighborsClassifier(n_neighbors=test_number)
predictions = model.fit(X_train, y_train).predict(X_test)
accuracy = accuracy_score(y_test, predictions)
accuracy_list.append(accuracy)
Which we then collated in a dataframe to analyse.
HYPERPARAMETER TUNING IN PYTHON
Automating 2 Hyperparameters
What about testing values of 2 hyperparameters?
Using a GBM algorithm:
learn_rate – [0.001, 0.01, 0.05]
max_depth –[4,6,8,10]
We could use a (nested) for loop!
HYPERPARAMETER TUNING IN PYTHON
Automating 2 Hyperparameters
Firstly a model creation function:
def gbm_grid_search(learn_rate, max_depth):
model = GradientBoostingClassifier(
learning_rate=learn_rate,
max_depth=max_depth)
predictions = model.fit(X_train, y_train).predict(X_test)
return([learn_rate, max_depth, accuracy_score(y_test, predictions)])
HYPERPARAMETER TUNING IN PYTHON
Automating 2 Hyperparameters
Now we can loop through our lists of hyperparameters and call our function:
results_list = []
for learn_rate in learn_rate_list:
for max_depth in max_depth_list:
results_list.append(gbm_grid_search(learn_rate,max_depth))
HYPERPARAMETER TUNING IN PYTHON
Automating 2 Hyperparameters
We can put these results into a DataFrame as well and print out:
results_df = pd.DataFrame(results_list, columns=['learning_rate', 'max_depth', 'accuracy
print(results_df)
HYPERPARAMETER TUNING IN PYTHON
How many models?
There were many more models built by adding more hyperparameters and values.
The relationship is not linear, it is exponential
One more value of a hyperparameter is not just one model
5 for Hyperparameter 1 and 10 for Hyperparameter 2 is 50 models!
What about cross-validation?
10-fold cross-validation would make 50x10 = 500 models!
HYPERPARAMETER TUNING IN PYTHON
From 2 to N hyperparameters
What about adding more hyperparameters?
We could nest our loop!
# Adjust the list of values to test
learn_rate_list = [0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5]
max_depth_list = [4,6,8, 10, 12, 15, 20, 25, 30]
subsample_list = [0.4,0.6, 0.7, 0.8, 0.9]
max_features_list = ['auto', 'sqrt']
HYPERPARAMETER TUNING IN PYTHON
From 2 to N hyperparameters
Adjust our function:
def gbm_grid_search(learn_rate, max_depth,subsample,max_features):
model = GradientBoostingClassifier(
learning_rate=learn_rate,
max_depth=max_depth,
subsample=subsample,
max_features=max_features)
predictions = model.fit(X_train, y_train).predict(X_test)
return([learn_rate, max_depth, accuracy_score(y_test, predictions)])
HYPERPARAMETER TUNING IN PYTHON
From 2 to N hyperparameters
Adjusting our for loop (nesting):
for learn_rate in learn_rate_list:
for max_depth in max_depth_list:
for subsample in subsample_list:
for max_features in max_features_list:
results_list.append(gbm_grid_search(learn_rate,max_depth,
subsample,max_features))
results_df = pd.DataFrame(results_list, columns=['learning_rate',
'max_depth', 'subsample', 'max_features','accuracy'])
print(results_df)
HYPERPARAMETER TUNING IN PYTHON
From 2 to N hyperparameters
How many models now?
7x9x5x2 = 630 (6,300 if cross-validated!)
We can't keep nesting forever!
Plus, what if we wanted:
Details on training times & scores
Details on cross-validation scores
HYPERPARAMETER TUNING IN PYTHON
Introducing Grid Search
Let's create a grid:
Down the left all values of max_depth
Across the top all values of learning_rate
HYPERPARAMETER TUNING IN PYTHON
Introducing Grid Search
Working through each cell on the grid:
(4,0.001) is equivalent to making an estimator like so:
GradientBoostingClassifier(max_depth=4, learning_rate=0.001)
HYPERPARAMETER TUNING IN PYTHON
Grid Search Pros & Cons
Some advantages of this approach:
Advantages:
You don’t have to write thousands of lines of code
Finds the best model within the grid (*special note here!)
Easy to explain
HYPERPARAMETER TUNING IN PYTHON
Grid Search Pros & Cons
Some disadvantages of this approach:
Computationally expensive! Remember how quickly we made 6,000+ models?
It is 'uninformed'. Results of one model don't help creating the next model.
We will cover 'informed' methods later!
HYPERPARAMETER TUNING IN PYTHON
Let's practice!
H Y P E R PA R A M E T E R T U N I N G I N P Y T H O N
Grid Search with
Scikit Learn
H Y P E R PA R A M E T E R T U N I N G I N P Y T H O N
Alex Scriven
Data Scientist
GridSearchCV Object
Introducing a GridSearchCV object:
sklearn.model_selection.GridSearchCV(
estimator,
param_grid, scoring=None, fit_params=None,
n_jobs=None, iid=’warn’, refit=True, cv=’warn’,
verbose=0, pre_dispatch=‘2*n_jobs’,
error_score=’raise-deprecating’,
return_train_score=’warn’)
HYPERPARAMETER TUNING IN PYTHON
Steps in a Grid Search
Steps in a Grid Search:
1. An algorithm to tune the hyperparameters. (Sometimes called an ‘estimator’)
2. De ning which hyperparameters we will tune
3. De ning a range of values for each hyperparameter
4. Setting a cross-validation scheme; and
5. De ne a score function so we can decide which square on our grid was ‘the best’.
6. Include extra useful information or functions
HYPERPARAMETER TUNING IN PYTHON
GridSearchCV Object Inputs
The important inputs are:
estimator
param_grid
cv
scoring
refit
n_jobs
return_train_score
HYPERPARAMETER TUNING IN PYTHON
GridSearchCV 'estimator'
The estimator input:
Essentially our algorithm
You have already worked with KNN, Random Forest, GBM, Logistic Regression
Remember:
Only one estimator per GridSearchCV object
HYPERPARAMETER TUNING IN PYTHON
GridSearchCV 'param_grid'
The param_grid input:
Setting which hyperparameters and values to test
Rather than a list:
max_depth_list = [2, 4, 6, 8]
min_samples_leaf_list = [1, 2, 4, 6]
This would be:
param_grid = {'max_depth': [2, 4, 6, 8],
'min_samples_leaf': [1, 2, 4, 6]}
HYPERPARAMETER TUNING IN PYTHON
GridSearchCV 'param_grid'
The param_grid input:
Remember: The keys in your param_grid dictionary must be valid hyperparameters.
For example, for a Logistic regression estimator:
# Incorrect
param_grid = {'C': [0.1,0.2,0.5],
'best_choice': [10,20,50]}
ValueError: Invalid parameter best_choice for estimator LogisticRegression
HYPERPARAMETER TUNING IN PYTHON
GridSearchCV 'cv'
The cv input:
Choice of how to undertake cross-validation
Using an integer undertakes k-fold cross validation where 5 or 10 is usually standard
HYPERPARAMETER TUNING IN PYTHON
GridSearchCV 'scoring'
The scoring input:
Which score to use to choose the best grid square (model)
Use your own or Scikit Learn's metrics module
You can check all the built in scoring functions this way:
from sklearn import metrics
sorted(metrics.SCORERS.keys())
HYPERPARAMETER TUNING IN PYTHON
GridSearchCV 're t'
The refit input:
Fits the best hyperparameters to the training data
Allows the GridSearchCV object to be used as an estimator (for prediction)
A very handy option!
HYPERPARAMETER TUNING IN PYTHON
GridSearchCV 'n_jobs'
The n_jobs input:
Assists with parallel execution
Allows multiple models to be created at the same time, rather than one after the other
Some handy code:
import os
print(os.cpu_count())
Careful using all your cores for modelling if you want to do other work!
HYPERPARAMETER TUNING IN PYTHON
GridSearchCV 'return_train_score'
The return_train_score input:
Logs statistics about the training runs that were undertaken
Useful for analyzing bias-variance trade-off but adds computational expense.
Does not assist in picking the best model, only for analysis purposes
HYPERPARAMETER TUNING IN PYTHON
Building a GridSearchCV object
Building our own GridSearchCV Object:
# Create the grid
param_grid = {'max_depth': [2, 4, 6, 8], 'min_samples_leaf': [1, 2, 4, 6]}
#Get a base classifier with some set parameters.
rf_class = RandomForestClassifier(criterion='entropy', max_features='auto')
HYPERPARAMETER TUNING IN PYTHON
Building a GridSearchCv Object
Putting the pieces together:
grid_rf_class = GridSearchCV(
estimator = rf_class,
param_grid = parameter_grid,
scoring='accuracy',
n_jobs=4,
cv = 10,
refit=True,
return_train_score=True)
HYPERPARAMETER TUNING IN PYTHON
Using a GridSearchCV Object
Because we set refit to True we can directly use the object:
#Fit the object to our data
grid_rf_class.fit(X_train, y_train)
# Make predictions
grid_rf_class.predict(X_test)
HYPERPARAMETER TUNING IN PYTHON
Let's practice!
H Y P E R PA R A M E T E R T U N I N G I N P Y T H O N
Understanding a grid
search output
H Y P E R PA R A M E T E R T U N I N G I N P Y T H O N
Alex Scriven
Data Scientist
Analyzing the output
Let's analyze the GridSearchCV outputs.
Three different groups for the GridSearchCV properties;
A results log
cv_results_
The best results
best_index_ , best_params_ & best_index_
'Extra information'
scorer_ , n_splits_ & refit_time_
HYPERPARAMETER TUNING IN PYTHON
Accessing object properties
Properties are accessed using the dot notation.
For example:
grid_search_object.property
Where property is the actual property you want to retrieve
HYPERPARAMETER TUNING IN PYTHON
The `.cv_results_` property
The cv_results_ property:
Read this into a DataFrame to print and analyze:
cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)
print(cv_results_df.shape)
(12, 23)
The 12 rows for the 12 squares in our grid or 12 models we ran
HYPERPARAMETER TUNING IN PYTHON
The `.cv_results_` 'time' columns
The test_score columns contain the scores on our test set for each of our cross-folds as well as some
summary statistics:
HYPERPARAMETER TUNING IN PYTHON
The .cv_results_ 'param_' columns
The param_ columns store the parameters it tested on that row, one column per parameter
HYPERPARAMETER TUNING IN PYTHON
The `.cv_results_` 'param' column
The params column contains dictionary of all the parameters:
pd.set_option("display.max_colwidth", -1)
print(cv_results_df.loc[:, "params"])
HYPERPARAMETER TUNING IN PYTHON
The `.cv_results_` 'test_score' columns
The test_score columns contain the scores on our test set for each of our cross-folds as well as some
summary statistics:
HYPERPARAMETER TUNING IN PYTHON
The `.cv_results_` 'rank_test_score' column
The rank column, ordering the mean_test_score from best to worst:
HYPERPARAMETER TUNING IN PYTHON
Extracting the best row
We can select the best grid square easily from cv_results_ using the rank_test_score column
best_row = cv_results_df[cv_results_df["rank_test_score"] == 1]
print(best_row)
HYPERPARAMETER TUNING IN PYTHON
The .cv_results_ 'train_score' columns
The test_score columns are then repeated for the training_scores .
Some important notes to keep in mind:
return_train_score must be True to include training scores columns.
There is no ranking column for the training scores, as we only care about test set performance
HYPERPARAMETER TUNING IN PYTHON
The best grid square
Information on the best grid square is neatly summarized in the following three properties:
best_params_ , the dictionary of parameters that gave the best score.
best_score_ , the actual best score.
best_index , the row in our cv_results_.rank_test_score that was the best.
HYPERPARAMETER TUNING IN PYTHON
The `best_estimator_` property
The best_estimator_ property is an estimator built using the best parameters from the grid search.
For us this is a Random Forest estimator:
type(grid_rf_class.best_estimator_)
sklearn.ensemble.forest.RandomForestClassifier
We could also directly use this object as an estimator if we want!
HYPERPARAMETER TUNING IN PYTHON
The `best_estimator_` property
print(grid_rf_class.best_estimator_)
HYPERPARAMETER TUNING IN PYTHON
Extra information
Some extra information is available in the following properties:
scorer_
What scorer function was used on the held out data. (we set it to AUC)
n_splits_
How many cross-validation splits. (We set to 5)
refit_time_
The number of seconds used for re tting the best model on the whole dataset.
HYPERPARAMETER TUNING IN PYTHON
Let's practice!
H Y P E R PA R A M E T E R T U N I N G I N P Y T H O N