Thanks to visit codestin.com
Credit goes to github.com

Skip to content

CalibratedClassifierCV doesn't interact properly with Pipeline estimators #8710

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stoddardg opened this issue Apr 6, 2017 · 27 comments · Fixed by #19555
Closed

CalibratedClassifierCV doesn't interact properly with Pipeline estimators #8710

stoddardg opened this issue Apr 6, 2017 · 27 comments · Fixed by #19555
Labels
Easy Well-defined and straightforward way to resolve Enhancement good first issue Easy with clear instructions to resolve

Comments

@stoddardg
Copy link

stoddardg commented Apr 6, 2017

Hi,

I'm trying to use CalibratedClassifierCV to calibrate the probabilities from a Gradient Boosted Tree model. The GBM is wrapped in a Pipeline estimator, where the initial stages of the Pipeline convert categoricals (using DictVectorizer) prior to the GBM being fit. The issue is that when I try to similarly use CalibratedClassifierCV, with a prefit estimator, it fails when I pass in the data. Here's a small example:

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.calibration import CalibratedClassifierCV, _CalibratedClassifier
from sklearn.pipeline import Pipeline

fake_features = [
    {'state':'NY','age':'adult'},
    {'state':'TX','age':'adult'},
    {'state':'VT','age':'child'}
]

labels = [1,0,1]

pipeline = Pipeline([
            ('vectorizer',DictVectorizer()),
            ('clf',RandomForestClassifier())
    ])

pipeline.fit(fake_features, labels)

clf_isotonic = CalibratedClassifierCV(base_estimator=pipeline, cv='prefit', method='isotonic')
clf_isotonic.fit(fake_features, labels)

When running that, I get the following error on the last line:

TypeError: float() argument must be a string or a number, not 'dict'

On the other hand, if I replace the last two lines with the following, things work fine:

clf_isotonic = _CalibratedClassifier(base_estimator=pipeline, method='isotonic')
clf_isotonic.fit(fake_features, labels)

It seems that CalibratedClassifierCV checks to see if the X data is valid prior to invoking anything about the base estimator (https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/calibration.py#L126). In my case, this logic seems slightly off since I'm using the pipeline to convert the data into the proper form before feeding it into estimator.

On the other hand, _CalibratedClassifier doesn't make this check first, so the code works (i.e. the data is fed into the pipeline, the model is fit, and then probabilities are calibrated appropriately).

My use case (which is not reflected in the example) is to use the initial stages of the pipeline to select columns from a dataframe, encode the categoricals, and then fit the model. I then pickle the fitted pipeline (after using GridSearchCV to select hyperparameters). Later on, I can load the model and use it to predict on new data, while abstracting away from what needs to be transformed in the raw data. I now want to calibrate the model after fitting it but ran into this problem.

For reference, here's all my system info:

Linux-3.10.0-514.2.2.el7.x86_64-x86_64-with-redhat-7.3-Maipo
Python 3.6.0 |Continuum Analytics, Inc.| (default, Dec 23 2016, 12:22:00) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.12.0
SciPy 0.18.1
Scikit-Learn 0.18.1

Thanks for reading (and for all of your hard work on scikit-learn!).

@amueller
Copy link
Member

amueller commented Apr 6, 2017

Hm I get

ValueError: Got X with X.ndim=1. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

but I'm on master. Either way, I agree that we don't need to do input validation here, it should be in in the base estimator.

@stoddardg
Copy link
Author

stoddardg commented Apr 6, 2017

I should have noted that in my actual use case, I get a different error

ValueError: could not convert string to float:

which is caused by a column of strings in the DataFrame that I pass. In both cases, it is an error about the input data.

@jnothman
Copy link
Member

jnothman commented Apr 6, 2017 via email

@amueller
Copy link
Member

amueller commented Apr 6, 2017

FYI theoretically I would prefer make_pipeline(DictVectorizer(), CallibratedClassifierCV(RandomForestClassifier)) because you don't need to cross-validate the DictVectorizer part (though you're prefitting anyhow I guess....
But that also doesn't work because #6451 (at least if the pipeline clones). I just recently realized that that will become a problem with pipeline in the future....

@stoddardg
Copy link
Author

So this wasn't in my example above, but the pipeline gets put into GridSearchCV and then I want to calibrate the model chosen from GridSearchCV. How does GridSearchCV interact with CalibratedClassifierCV?

From my limited understanding, CalibratedClassifierCV produces K models, where each model is constructed with a K-1 folds, and then averages predictions from each of the K models to make a prediction. This seems semantically different from standard CV where you select the model with the best performance over the K folds but then construct a single model using the entire training data. I'm not sure how to include the CalibrationCV into that procedure.

The thing that made sense to me with CalibratedClassifierCV was to prefit the model from GridSearchCV and then calibrate with a different set of data (as the per the docs recommendation).

@amueller
Copy link
Member

amueller commented Apr 6, 2017

How does GridSearchCV interact with CalibratedClassifierCV?

Without prefit it's fine, with prefit it won't work right now :-/

You could just nest the GridSearchCV with the cross-validation of the CalibratedClassifierCV, thought that would be a bit computationally expensive.

@baharian
Copy link

baharian commented May 3, 2017

I am also interested in knowing the status of interplay between RandomizedSearchCV (or GridSearchCV) and CalibratedClassifierCV. I currently do hyper-parameter optimization with RandomizedSearchCV (which by default fits the best model to the entire dataset it was given) and then calibrate this model with CalibratedClassifierCV using cv = 'prefit' argument.

connorbrinton pushed a commit to connorbrinton/scikit-learn that referenced this issue Jan 28, 2019
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
connorbrinton added a commit to connorbrinton/scikit-learn that referenced this issue Feb 8, 2019
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
@henningsway
Copy link

I'm currently running into the same issue.

Is this clearly an input-validation-issue, so that the encoding of categorical data won't be attempted before an error is thrown? Is there a way to disable this input-validation locally?

(I am also not familiar with _CalibratedClassifier and how to use it)

@amueller
Copy link
Member

amueller commented Jun 5, 2019

Hm I'm a bit surprised by the outcome of #13077 but I guess using dicts wasn't anticipated. I only skimmed the discussion there but I feel like we should skip the input validation even more.

@henningsway
Copy link

I currently use _CalibratedClassifier to calibrate a Pipeline-Model (optimised by GridSearchCV) on some extra validation data to workaround the too restrictive input validation of CalibratedClassifierCV.

Is there anything I loose with this approach compared to CalibratedClassifierCV with prefit=True?

Ideally, I would be able to calibrate my model in the Pipeline/Gridsearch already, but I have trouble to pass the parameters from the Pipeline-Gridsearch to the calibrated model.

Others seem to have similar problems: https://stackoverflow.com/a/49833102/1392529
Any suggestions here?

@jnothman
Copy link
Member

jnothman commented Jun 18, 2019 via email

@amueller amueller added Easy Well-defined and straightforward way to resolve good first issue Easy with clear instructions to resolve labels Jul 12, 2019
@jsadloske
Copy link

I think they are asking how to do this scenario.

from sklearn.datasets import make_moons
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.calibration import CalibratedClassifierCV

X, y = make_moons()

my_pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('model', CalibratedClassifierCV(RandomForestClassifier(n_estimators=10), cv=5))
])

param_grid = {
    'model__max_depth': list(range(2, 10))
}

search = RandomizedSearchCV(my_pipeline, param_grid, cv=3, n_iter=4, iid=False)
search.fit(X, y)

How do I modify param_grid so max_depth gets passed to RandomForestClassifier?

This is similar to how VotingClassifier works, but there the estimators parameter takes a tuple so you can give RandomForestClassifier a name and reference it in param_grid.

rf = RandomForestClassifier(n_estimators=10)

my_pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('model', VotingClassifier([('rf', rf)]))  # we can name it 'rf'
])

param_grid = {
    'model__rf__max_depth': list(range(2, 10))
}

search = RandomizedSearchCV(my_pipeline, param_grid, cv=3, n_iter=4, iid=False)
search.fit(X, y)
search.best_params_

@amueller
Copy link
Member

amueller commented Jul 29, 2019

model__estimator__max_depth is the answer. Where in the docs should we add this to be easy to find?

@jsadloske
Copy link

For anyone else reading this, I got it to work with model__base_estimator__max_depth as I guess the parameter is called base_estimator here.

It doesn't seem to work with fit_params though.

fit_params = {
    'model__base_estimator__sample_weight': np.random.random(size=X.shape[0])
}

search = RandomizedSearchCV(my_pipeline, param_grid, cv=3, n_iter=4, iid=False)
search.fit(X, y, **fit_params)

TypeError: fit() got an unexpected keyword argument 'base_estimator__sample_weight

I guess it would go in this section https://scikit-learn.org/stable/modules/compose.html#nested-parameters

@amueller
Copy link
Member

amueller commented Aug 1, 2019

The CallibratedClassifierCV doesn't have fit_params but has sample_weight and passes it through correctly, so

fit_params = {
    'model__sample_weight': np.random.random(size=X.shape[0])
}

search = RandomizedSearchCV(my_pipeline, param_grid, cv=3, n_iter=4, iid=False)
search.fit(X, y, **fit_params)

works.
Yes, that's inconsistent, and we're looking for a fix.
It's a bit tricky to figure out where in the process you want to apply sample weights and where not, see #4497.
Basically, you could use the sample weights for fitting both the forest and the calibration, for just the calibration, or for just the forest. And we haven't decided on a syntax to choose between the options.
Right now, CalibratedClassifierCV only allows you to do both and not separately. What we do with the parameters is slightly different because they only ever apply to a single estimator. But you might have a long pipeline and use sample_weights in each step, but you (maybe) don't want to specify fit_params for each step in the pipeline.
Also see scikit-learn/enhancement_proposals#16

@amueller
Copy link
Member

amueller commented Aug 1, 2019

Hm actually I think it should go into the GridSearchCV documentation. You pointed towards the pipeline documentation, but it's actually somewhat unrelated to pipeline.
One could also argue it's unrelated to GridSearchCV and just concerns set_params but I think GridSearchCV is where it's most likely to matter.

@amueller
Copy link
Member

amueller commented Aug 1, 2019

I'm pretty sure this issue now contains (at least) three completely separate issues and we should probably separate them.

@amueller
Copy link
Member

amueller commented Aug 1, 2019

@jsadloske I tried to add an example here based on your use-case: #14548

The inconsistencies between fit_params and set_params will probably take a bit longer to fix.

connorbrinton added a commit to connorbrinton/scikit-learn that referenced this issue Jun 10, 2020
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
connorbrinton added a commit to connorbrinton/scikit-learn that referenced this issue Jun 12, 2020
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
connorbrinton added a commit to connorbrinton/scikit-learn that referenced this issue Jun 12, 2020
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
connorbrinton added a commit to connorbrinton/scikit-learn that referenced this issue Jun 12, 2020
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
connorbrinton added a commit to connorbrinton/scikit-learn that referenced this issue Jun 12, 2020
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
connorbrinton added a commit to connorbrinton/scikit-learn that referenced this issue Jun 12, 2020
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
connorbrinton added a commit to connorbrinton/scikit-learn that referenced this issue Jun 12, 2020
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
connorbrinton added a commit to connorbrinton/scikit-learn that referenced this issue Jun 12, 2020
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
connorbrinton added a commit to connorbrinton/scikit-learn that referenced this issue Jun 12, 2020
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
connorbrinton added a commit to connorbrinton/scikit-learn that referenced this issue Jun 12, 2020
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
connorbrinton added a commit to connorbrinton/scikit-learn that referenced this issue Jun 12, 2020
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
connorbrinton added a commit to connorbrinton/scikit-learn that referenced this issue Jun 12, 2020
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
connorbrinton added a commit to connorbrinton/scikit-learn that referenced this issue Jun 12, 2020
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
connorbrinton added a commit to connorbrinton/scikit-learn that referenced this issue Jul 6, 2020
CalibratedClassifierCV now handles the calibration process in such a way
that probability estimates can be calibrated for multi-label targets.
Several methods of CalibratedClassifierCV and _CalibratedClassifier were
also cleaned up to support this new functionality. Changes include:
* Looser input validation on arguments passed to wrapped classifiers
  (fixes scikit-learn#8710)
* Target classes and type are determined before cross-validation, rather
  than on each fold individually
* Label predictions from CalibratedClassifierCV.predict are obtained
  using `LabelBinarizer.inverse_transform`, which supports multi-label
  predictions
* Specialized logic in _CalibratedClassifier for handling binary
  classification problems is more thoroughly commented
* Shape of uncalibrated estimates from wrapped classifier is checked
  against the expected shape in _CalibratedClassifier
* Simplification of logic in _CalibratedClassifier.predict_proba along
  with more comments explaining what's happening
* Tests for acceptance of 1D feature arrays as input and valid
  multi-label probability predictions
@ftrojan
Copy link

ftrojan commented Dec 24, 2020

I tried the code example in the original post on my Mac with Python 3.7.3, Numpy 1.16.4 and Scikit-Learn 0.24.0 and everything runs smoothly without any error. I would propose to either post a new reproducible example or close this issue.

@odedbd
Copy link

odedbd commented Jan 24, 2021

@ftrojan - I ran into a similar issue with my own code, which uses CountVectorizer as part of a pipeline to classify text data. Adding the CalibratedClassiferCV as part of a pipeline fit with GridSearchCV (not using cv='prefit'), my code gets an error due to data validation in CalibratedClassfierCV (self._validate_data). I verified that commenting out the data validation removes the error.

I cannot share my code (due to company policy and also it's quite involved with other complications). I have modified the OP sample to use cross validation rather than 'prefit'. If you run the code below you will get the exception from validate_data as I do in my own code. I tried running this code with the validate_data line commented out, but then I get a "TypeError: only integer scalar arrays can be converted to a scalar index" error which I wasn't able to overcome.

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.calibration import CalibratedClassifierCV, _CalibratedClassifier
from sklearn.pipeline import Pipeline

fake_features = [
    {'state':'NY','age':'adult'},
    {'state':'TX','age':'adult'},
    {'state':'VT','age':'child'},
    {'state':'NY','age':'adult'},
    {'state':'TX','age':'adult'},
    {'state':'VT','age':'child'}
]

labels = [1,0,1,1,0,1]

pipeline = Pipeline([
            ('vectorizer',DictVectorizer()),
            ('clf',RandomForestClassifier())
    ])

# pipeline.fit(fake_features, labels)

clf_isotonic = CalibratedClassifierCV(base_estimator=pipeline, cv=2, method='isotonic')
clf_isotonic.fit(fake_features, labels)

@odedbd
Copy link

odedbd commented Jan 26, 2021

Also, while the fit works, running prediction fails, due to a similar check in predict_proba. Removing the check does not work for this use case, because the code of predict_proba assumes that X has a .shape property, which obviously fails for this list of dicts. For my own use case I was able to hack my way around this, by overloading predict_proba, since my X input is a pandas DataFrame which fortunately has the .shape property

@ogrisel
Copy link
Member

ogrisel commented Feb 25, 2021

@odedbd the prediction should be fixed in #19555 and your minimal reproducer passes on this branch. Let me know if that's enough for you.

@ogrisel
Copy link
Member

ogrisel commented Mar 10, 2021

Fixed by #19641.

@ogrisel ogrisel closed this as completed Mar 10, 2021
@odedbd
Copy link

odedbd commented Mar 11, 2021

@ogrisel Thanks, if the reproducer code works it should be fine. The merged changes are on the scikit-learn:main branch, right? Until now I only worked off the released versions, should I be able to pip install directly from the github-repo in order to test this on my own code, or do I need to clone the repo and build? I apologize if this is not the right place to ask this.

@ogrisel
Copy link
Member

ogrisel commented Apr 8, 2021

@odedbd sorry I had not seen your reply. You can test that the fix solve your problem using the nightly build if you do not want to build from source with your own compilers:

https://scikit-learn.org/stable/developers/advanced_installation.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve Enhancement good first issue Easy with clear instructions to resolve
Projects
None yet