Thanks to visit codestin.com
Credit goes to github.com

Skip to content

CalibratedClassifierCV doesn't interact properly with Pipeline estimators  #8710

Closed
@stoddardg

Description

@stoddardg

Hi,

I'm trying to use CalibratedClassifierCV to calibrate the probabilities from a Gradient Boosted Tree model. The GBM is wrapped in a Pipeline estimator, where the initial stages of the Pipeline convert categoricals (using DictVectorizer) prior to the GBM being fit. The issue is that when I try to similarly use CalibratedClassifierCV, with a prefit estimator, it fails when I pass in the data. Here's a small example:

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.calibration import CalibratedClassifierCV, _CalibratedClassifier
from sklearn.pipeline import Pipeline

fake_features = [
    {'state':'NY','age':'adult'},
    {'state':'TX','age':'adult'},
    {'state':'VT','age':'child'}
]

labels = [1,0,1]

pipeline = Pipeline([
            ('vectorizer',DictVectorizer()),
            ('clf',RandomForestClassifier())
    ])

pipeline.fit(fake_features, labels)

clf_isotonic = CalibratedClassifierCV(base_estimator=pipeline, cv='prefit', method='isotonic')
clf_isotonic.fit(fake_features, labels)

When running that, I get the following error on the last line:

TypeError: float() argument must be a string or a number, not 'dict'

On the other hand, if I replace the last two lines with the following, things work fine:

clf_isotonic = _CalibratedClassifier(base_estimator=pipeline, method='isotonic')
clf_isotonic.fit(fake_features, labels)

It seems that CalibratedClassifierCV checks to see if the X data is valid prior to invoking anything about the base estimator (https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/calibration.py#L126). In my case, this logic seems slightly off since I'm using the pipeline to convert the data into the proper form before feeding it into estimator.

On the other hand, _CalibratedClassifier doesn't make this check first, so the code works (i.e. the data is fed into the pipeline, the model is fit, and then probabilities are calibrated appropriately).

My use case (which is not reflected in the example) is to use the initial stages of the pipeline to select columns from a dataframe, encode the categoricals, and then fit the model. I then pickle the fitted pipeline (after using GridSearchCV to select hyperparameters). Later on, I can load the model and use it to predict on new data, while abstracting away from what needs to be transformed in the raw data. I now want to calibrate the model after fitting it but ran into this problem.

For reference, here's all my system info:

Linux-3.10.0-514.2.2.el7.x86_64-x86_64-with-redhat-7.3-Maipo
Python 3.6.0 |Continuum Analytics, Inc.| (default, Dec 23 2016, 12:22:00) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.12.0
SciPy 0.18.1
Scikit-Learn 0.18.1

Thanks for reading (and for all of your hard work on scikit-learn!).

Metadata

Metadata

Assignees

No one assigned

    Labels

    EasyWell-defined and straightforward way to resolveEnhancementgood first issueEasy with clear instructions to resolve

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions