-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
CalibratedClassifierCV doesn't interact properly with Pipeline estimators #8710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hm I get
but I'm on master. Either way, I agree that we don't need to do input validation here, it should be in in the base estimator. |
I should have noted that in my actual use case, I get a different error
which is caused by a column of strings in the DataFrame that I pass. In both cases, it is an error about the input data. |
I agree, we should not be validating X beyond the needs of safe_indexing or
similar. PR fixing it welcome.
…On 6 April 2017 at 12:12, Greg Stoddard ***@***.***> wrote:
I should have noted that in my actual use case, I get a different error
ValueError: could not convert string to float:
which is caused by a column of strings in the DataFrame that I pass.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8710 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz673y6pYFFDv8WKCfeBaEO2ie8DMOks5rtEnzgaJpZM4M1BdZ>
.
|
FYI theoretically I would prefer |
So this wasn't in my example above, but the pipeline gets put into GridSearchCV and then I want to calibrate the model chosen from GridSearchCV. How does GridSearchCV interact with CalibratedClassifierCV? From my limited understanding, CalibratedClassifierCV produces K models, where each model is constructed with a K-1 folds, and then averages predictions from each of the K models to make a prediction. This seems semantically different from standard CV where you select the model with the best performance over the K folds but then construct a single model using the entire training data. I'm not sure how to include the CalibrationCV into that procedure. The thing that made sense to me with CalibratedClassifierCV was to prefit the model from GridSearchCV and then calibrate with a different set of data (as the per the docs recommendation). |
Without prefit it's fine, with prefit it won't work right now :-/ You could just nest the GridSearchCV with the cross-validation of the CalibratedClassifierCV, thought that would be a bit computationally expensive. |
I am also interested in knowing the status of interplay between |
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
I'm currently running into the same issue. Is this clearly an input-validation-issue, so that the encoding of categorical data won't be attempted before an error is thrown? Is there a way to disable this input-validation locally? (I am also not familiar with |
Hm I'm a bit surprised by the outcome of #13077 but I guess using dicts wasn't anticipated. I only skimmed the discussion there but I feel like we should skip the input validation even more. |
I currently use Is there anything I loose with this approach compared to Ideally, I would be able to calibrate my model in the Pipeline/Gridsearch already, but I have trouble to pass the parameters from the Pipeline-Gridsearch to the calibrated model. Others seem to have similar problems: https://stackoverflow.com/a/49833102/1392529 |
I think the stackoverflow you cite is mostly a misunderstanding of syntax.
If you want us to properly understand your issue here, please provide code
to explain what you're doing / trying.
|
I think they are asking how to do this scenario.
How do I modify This is similar to how
|
|
For anyone else reading this, I got it to work with It doesn't seem to work with
I guess it would go in this section https://scikit-learn.org/stable/modules/compose.html#nested-parameters |
The fit_params = {
'model__sample_weight': np.random.random(size=X.shape[0])
}
search = RandomizedSearchCV(my_pipeline, param_grid, cv=3, n_iter=4, iid=False)
search.fit(X, y, **fit_params) works. |
Hm actually I think it should go into the GridSearchCV documentation. You pointed towards the |
I'm pretty sure this issue now contains (at least) three completely separate issues and we should probably separate them. |
@jsadloske I tried to add an example here based on your use-case: #14548 The inconsistencies between |
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
CalibratedClassifierCV now handles the calibration process in such a way that probability estimates can be calibrated for multi-label targets. Several methods of CalibratedClassifierCV and _CalibratedClassifier were also cleaned up to support this new functionality. Changes include: * Looser input validation on arguments passed to wrapped classifiers (fixes scikit-learn#8710) * Target classes and type are determined before cross-validation, rather than on each fold individually * Label predictions from CalibratedClassifierCV.predict are obtained using `LabelBinarizer.inverse_transform`, which supports multi-label predictions * Specialized logic in _CalibratedClassifier for handling binary classification problems is more thoroughly commented * Shape of uncalibrated estimates from wrapped classifier is checked against the expected shape in _CalibratedClassifier * Simplification of logic in _CalibratedClassifier.predict_proba along with more comments explaining what's happening * Tests for acceptance of 1D feature arrays as input and valid multi-label probability predictions
I tried the code example in the original post on my Mac with Python 3.7.3, Numpy 1.16.4 and Scikit-Learn 0.24.0 and everything runs smoothly without any error. I would propose to either post a new reproducible example or close this issue. |
@ftrojan - I ran into a similar issue with my own code, which uses CountVectorizer as part of a pipeline to classify text data. Adding the CalibratedClassiferCV as part of a pipeline fit with GridSearchCV (not using cv='prefit'), my code gets an error due to data validation in CalibratedClassfierCV (self._validate_data). I verified that commenting out the data validation removes the error. I cannot share my code (due to company policy and also it's quite involved with other complications). I have modified the OP sample to use cross validation rather than 'prefit'. If you run the code below you will get the exception from validate_data as I do in my own code. I tried running this code with the validate_data line commented out, but then I get a "TypeError: only integer scalar arrays can be converted to a scalar index" error which I wasn't able to overcome. from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.calibration import CalibratedClassifierCV, _CalibratedClassifier
from sklearn.pipeline import Pipeline
fake_features = [
{'state':'NY','age':'adult'},
{'state':'TX','age':'adult'},
{'state':'VT','age':'child'},
{'state':'NY','age':'adult'},
{'state':'TX','age':'adult'},
{'state':'VT','age':'child'}
]
labels = [1,0,1,1,0,1]
pipeline = Pipeline([
('vectorizer',DictVectorizer()),
('clf',RandomForestClassifier())
])
# pipeline.fit(fake_features, labels)
clf_isotonic = CalibratedClassifierCV(base_estimator=pipeline, cv=2, method='isotonic')
clf_isotonic.fit(fake_features, labels) |
Also, while the fit works, running prediction fails, due to a similar check in predict_proba. Removing the check does not work for this use case, because the code of predict_proba assumes that X has a .shape property, which obviously fails for this list of dicts. For my own use case I was able to hack my way around this, by overloading predict_proba, since my X input is a pandas DataFrame which fortunately has the .shape property |
Fixed by #19641. |
@ogrisel Thanks, if the reproducer code works it should be fine. The merged changes are on the scikit-learn:main branch, right? Until now I only worked off the released versions, should I be able to pip install directly from the github-repo in order to test this on my own code, or do I need to clone the repo and build? I apologize if this is not the right place to ask this. |
@odedbd sorry I had not seen your reply. You can test that the fix solve your problem using the nightly build if you do not want to build from source with your own compilers: https://scikit-learn.org/stable/developers/advanced_installation.html |
Hi,
I'm trying to use CalibratedClassifierCV to calibrate the probabilities from a Gradient Boosted Tree model. The GBM is wrapped in a Pipeline estimator, where the initial stages of the Pipeline convert categoricals (using DictVectorizer) prior to the GBM being fit. The issue is that when I try to similarly use CalibratedClassifierCV, with a prefit estimator, it fails when I pass in the data. Here's a small example:
When running that, I get the following error on the last line:
On the other hand, if I replace the last two lines with the following, things work fine:
It seems that CalibratedClassifierCV checks to see if the X data is valid prior to invoking anything about the base estimator (https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/calibration.py#L126). In my case, this logic seems slightly off since I'm using the pipeline to convert the data into the proper form before feeding it into estimator.
On the other hand, _CalibratedClassifier doesn't make this check first, so the code works (i.e. the data is fed into the pipeline, the model is fit, and then probabilities are calibrated appropriately).
My use case (which is not reflected in the example) is to use the initial stages of the pipeline to select columns from a dataframe, encode the categoricals, and then fit the model. I then pickle the fitted pipeline (after using GridSearchCV to select hyperparameters). Later on, I can load the model and use it to predict on new data, while abstracting away from what needs to be transformed in the raw data. I now want to calibrate the model after fitting it but ran into this problem.
For reference, here's all my system info:
Thanks for reading (and for all of your hard work on scikit-learn!).
The text was updated successfully, but these errors were encountered: