Thanks to visit codestin.com
Credit goes to github.com

Skip to content

scikit-learn custom transformer is raising NotFitted Error #19953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amitmeel opened this issue Apr 22, 2021 · 14 comments
Closed

scikit-learn custom transformer is raising NotFitted Error #19953

amitmeel opened this issue Apr 22, 2021 · 14 comments

Comments

@amitmeel
Copy link

amitmeel commented Apr 22, 2021

Describe the bug

I was experimenting with scikit-learn after updating scikit-learn from 0.21.1 to 1.0.2, and found that the custom transformer had stopped working. I wonder what might have changed in version 1.0.2 which caused this issue. Is there a workaround to resolve this issue?
Below are code snippets of the same to reproduce the issue:

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.tree import DecisionTreeClassifier
class CustomVectorizer(BaseEstimator, TransformerMixin):
    """This class will perform the Vectorization."""
    def __init__(self, custom_content=None, custom_keyphrases=None):
        self.custom_content = custom_content
        self.custom_keyphrases = custom_keyphrases
    
    def fit(self, X, y=None, *args, **kwargs):
        """fit"""
        return self

    def transform(self, X, y=None, **transform_params):
        """Transform"""
        tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
        tfidf_custom_keyphrases = self.custom_keyphrases.transform(X.KeyPhrases.values.astype('U')).todense().astype(np.float32)

        tf_idf_X  = np.hstack((tfidf_custom_content, tfidf_custom_keyphrases))
        return tf_idf_X
class CustomSVD(BaseEstimator, TransformerMixin):
    """This class will perform the dimentionality reduction"""
    def __init__(self, tsvd=None, reduce_dim=True):
        self.tsvd = tsvd
        self.reduce_dim = reduce_dim
    
    def fit(self, X, y=None, *args, **kwargs):
        """fit"""
        return self

    def transform(self, X, y=None, **transform_params):
        """transform"""
        if self.reduce_dim:
            self.new_xtrain = self.tsvd.transform(X)
            return self.new_xtrain.astype(np.float32)
        else:
            return X.astype(np.float32)
example_dict = {"content": ["dfg sdfsd f rtygdfgdf sf sdf sdfdg df", "dfgdf sfdsd fs ertger sd g",
                            "dfgfdgdf fdgdf gfhfhrt", "fghgf c xzcvxwerkjwhx"],
                "KeyPhrases": ["sdfsd erfsd fsdf", " dfgdf ewrwe wef h dfh",
                                "fghfd wesdofjhcxlk sdf", "dfg dfg werwe"],
                "output":["pass", "fail", "pass", "fail"]}
# df = pd.DataFrame(example_dict)
print(df)

X = df[["content", "KeyPhrases"]]
y = df[["output"]]

tf_content = TfidfVectorizer(sublinear_tf=True, smooth_idf=True, stop_words='english')
tf_keyphrase = TfidfVectorizer(sublinear_tf=True, smooth_idf=True, stop_words='english')

tfidf_content = tf_content.fit_transform(X.content).toarray()
tfidf_keyphrases = tf_keyphrase.fit_transform(X.KeyPhrases.values.astype('U')).toarray()

X_temp = np.hstack((tfidf_content, tfidf_keyphrases))

  # on top of X_temp I applied TruncatedSVD  get the n_component for some thresold
# using explained varience here i'm hardcoding the n_components and fitting it.
tsvd = TruncatedSVD(n_components = 20, random_state=42)
X_new = tsvd.fit_transform(X_temp)

# pipeline
# Note: tf_content, tf_keyphrase and tsvd are already fitted and I'm passing them to 
# custom vectorizer and custom svd. hence i'm not doing aything in fit method.
# `cls` is a classifier.
rf_pipline = Pipeline([
    ('vectorizer', CustomVectorizer(tf_content, tf_keyphrase)),
    ('reduce_dim', CustomSVD(tsvd=tsvd)),
    ('rf_classifier', RandomForestClassifier())])

rf_search = {
    'vectorizer': [CustomVectorizer(tf_content, tf_keyphrase)],
    'reduce_dim': [CustomSVD(tsvd=tsvd)],
    'rf_classifier': [RandomForestClassifier()],
    'rf_classifier__n_estimators': [10,20],
}

cls_pipeline = RandomizedSearchCV(rf_pipline, rf_search, n_iter=2, cv=2, verbose=1)

cls_pipeline.fit(X,y)
print(cls_pipeline.score(X,y))
print(cls_pipeline.predict_proba(X))

This was working in scikit-learn 0.21.1 as expected and giving the below output:

------------------------
>>>1.0
[[0.3 0.7]
 [0.6 0.4]
 [0.2 0.8]
 [0.8 0.2]]

but in scikit-learn 1.0.2, I'm getting the below error:

----------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "sklearn_custom_transformer_issue.py", line 106, in <module>
    cls_pipeline.fit(X,y)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\model_selection\_search.py", line 926, in fit
    self.best_estimator_.fit(X, y, **fit_params)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\joblib\memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 893, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\base.py", line 855, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "sklearn_custom_transformer_issue.py", line 36, in transform
    tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\feature_extraction\text.py", line 2099, in transform
    check_is_fitted(self, msg="The TF-IDF vectorizer is not fitted")
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\utils\validation.py", line 1222, in check_is_fitted
    raise NotFittedError(msg % {"name": type(estimator).__name__})
sklearn.exceptions.NotFittedError: The TF-IDF vectorizer is not fitted

Also, when I defined a new classifier and used the voting classifier as shown below, I'm getting NotFitted Error in scikit-learn 1.0.2 but the same code was working with scikit-learn 0.21.1 :

# another pipeline
dc_pipline1 = Pipeline([
('vectorizer', CustomVectorizer(tf_content, tf_keyphrase)),
('reduce_dim', CustomSVD(tsvd=tsvd)),
('dc_classifier', DecisionTreeClassifier())])

dc_search = {
    'vectorizer': [CustomVectorizer(tf_content, tf_keyphrase)],
    'reduce_dim': [CustomSVD(tsvd=tsvd)],
    'dc_classifier': [DecisionTreeClassifier()],
    'dc_classifier__max_depth': [4, 10],
}

cls_pipeline1 = RandomizedSearchCV(dc_pipline1, dc_search, n_iter=2, cv=2, verbose=1)

Vot_cls = VotingClassifier(estimators=[('rf', cls_pipline), 
                                        ('dt', cls_pipline1)], 
                                       voting='soft')

Vot_cls.fit(X, y)
----------------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "temp.py", line 88, in <module>
    Vot_cls.fit(X, y)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\ensemble\_voting.py", line 292, in fit
    return super().fit(X, transformed_y, sample_weight)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\ensemble\_voting.py", line 74, in fit
    self.estimators_ = Parallel(n_jobs=self.n_jobs)(
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\utils\fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\ensemble\_base.py", line 39, in _fit_single_estimator     
    estimator.fit(X, y)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\base.py", line 702, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "temp.py", line 22, in transform
    tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\feature_extraction\text.py", line 1872, in transform      
    check_is_fitted(self, msg='The TF-IDF vectorizer is not fitted')
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\utils\validation.py", line 1041, in check_is_fitted       
    raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: The TF-IDF vectorizer is not fitted

Just wondering what got changed when we call it using the voting classifier. does it clone the estimator and it is not able to pass the fitted instance of tfidf to custom vectorizer.

Versions

scikit-learn=0.24.1
numpy=1.19.2
scipy=1.6.0
pandas=1.2.1
platform: Windows_x64
Python=3.6.10

Note: Code was working fine in scikit-learn version: 0.21.1 , numpy: 1.18.1, scipy: 1.3.1 , pandas:0.25.1.

Reproducible code: code

@glemaitre
Copy link
Member

Our Pipeline does not follow our own convention: #8157
We usually always clone the parameters in the constructor. The only one that do not do that is Pipeline.
However, the VotingClassifier will clone each pipeline first and the inner estimator. Therefore the TfidfVectorizer will get cloned and it will be equivalent to an unfitted estimator.

I assume that we should fix our Pipeline but this is not straightforward since users are relying on these features. If we have a Pipeline that clone steps, then we should probably look at how to freeze estimator (#8370) such that they don't get unfitted during cloning.

@amitmeel
Copy link
Author

amitmeel commented Apr 22, 2021

@glemaitre what i got from your point is that basically pipeline and randomizedsearchcv is making shallow copy and that leads to not able to clone the parameters in the constructor. But question is why it was working with 0.21.1 (did we make this change after this release)
Just thinking, can't we make a deep copy of the estimators so that we will be able to clone the params too (I think this will be at the cost of some extra memory).

Is there any workaround for the current situation?

@glemaitre
Copy link
Member

@glemaitre what i got from your point is that basically pipeline and randomizedsearchcv is making shallow copy

Pipeline does not do any copy and other estimator including RandomizedSearchCV are cloning (deep copy + deleting fitted attributes).

But question is why it was working with 0.21.3 (did we make this change after this release)

I would need to investigate more. I would not expect the behaviour to have changed.

@amitmeel
Copy link
Author

amitmeel commented Apr 27, 2021

@glemaitre is there any workaround for the above issue. Also, did you check for version 0.21.1.

Please let me know if, is there any specific file that I can check to look into it, why I am facing this behavior.

@glemaitre
Copy link
Member

I assume that the thing that could have changed is the check_is_fitted implementation. Since that your fit does not create any fitted attribute, check_is_fitted would not detect that the estimator is fitted. Now, we have a new API for this case:

def __sklearn_is_fitted__(self):
        return True

adding this method allows to bypass the error and check_is_fitted will see your estimator as always fitted if it is stateless. It is equivalent to our FunctionTransformer.

@amitmeel
Copy link
Author

Even after adding __sklearn_is_fitted__ to the custom transformer, I'm still getting the same error.

@amitmeel
Copy link
Author

@glemaitre can you please suggest the workaround since even after adding I'm getting the same error.

@amitmeel
Copy link
Author

In simple, how can we pass the fitted transformer (in the above case it is tf_content, tf_keyphrase that needs to be passed to CustomVectorizer) in custom transformers so that when we use it with RandomizedSearchCV or VotingClassifier it does not throw NotFittedError and do the transformations per expectation ?

@ogrisel ogrisel reopened this Mar 17, 2022
@ogrisel
Copy link
Member

ogrisel commented Mar 17, 2022

@amitmeel can you please try to make your reproducer as minimal as possible. I am under the impression that there are many unnecessary steps in your code. This makes it hard for us to understand what's causing you trouble.

https://scikit-learn.org/dev/developers/minimal_reproducer.html#minimal-reproducer

@ogrisel
Copy link
Member

ogrisel commented Mar 17, 2022

Something that I find weird in your code is that the CustomVectorizer meta-estimator does not call the fit method of its base estimators.

@ogrisel
Copy link
Member

ogrisel commented Mar 17, 2022

Same comment for CustomSVD.

@amitmeel
Copy link
Author

amitmeel commented Mar 19, 2022

@ogrisel if you closely look at the code we are passing a fitted transformer/vectorizer to a custom vectorizer.

rf_pipline = Pipeline([
    ('vectorizer', CustomVectorizer(tf_content, tf_keyphrase)),
    ('reduce_dim', CustomSVD(tsvd=tsvd)),
    ('classifier', RandomForestClassifier())])

In the above snippet, this tf_content and tf_keyphrase are fitted instances of TfidfVectorizer, that's why i'm not doing anything in CustomVectorizer fit method.

When I'm running the same code in version 0.21.1 , and checking whether tf_content is fitted or not using check_is_fitted , I can see it is a fitted instance of tfidf but in version 0.24.1 and 1.0.2,I found it weird that it is now a non-fitted instance of tfidf.

Reproducible code: code

In simple, how can we pass the fitted transformer (in the above case it is tf_content, tf_keyphrase that needs to be passed to CustomVectorizer) in custom transformers so that when we use it with RandomizedSearchCV or VotingClassifier it does not throw NotFittedError and do the transformations per expectation?

@amitmeel
Copy link
Author

@ogrisel @glemaitre any update on the above issue ?

  1. Are we planning to support such scenario where we can pass the fitted transformers to meta estimators (like using some argument in VotingClassifieror or RandomizedSearchCV that this is already fitted) ?

  2. One possible solution for now I can think of is: save the fitted transformers and load them in the CustomVectorizer`` and CustomSVD``` as shown below:

class CustomVectorizer(BaseEstimator, TransformerMixin):
    """This class will perform the Vectorization."""
    def __init__(self):
        self.custom_content = joblib.load('tf_content.pkl')
        self.custom_keyphrases = joblib.load('tf_keyphrase.pkl')
    
    def fit(self, X, y=None, *args, **kwargs):
        """fit"""
        return self

    def transform(self, X, y=None, **transform_params):
        """Transform"""
        tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
        tfidf_custom_keyphrases = self.custom_keyphrases.transform(X.KeyPhrases.values.astype('U')).todense().astype(np.float32)

        tf_idf_X  = np.hstack((tfidf_custom_content, tfidf_custom_keyphrases))
        return tf_idf_X
  1. I'm trying to dig deep into the code base but I'm still not able to understand how it was working with version 0.21.1 .
  2. any other work around for the same ?

@adrinjalali
Copy link
Member

We still don't have a minimal reproducible here.

You can use __sklearn_is_fitted__ to check if the sub-estimator is fitted and return true. But your code above loads stuff from a file which it shouldn't that should be done in fit. I'm closing this, will re-open once we have a minimal reproducible example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants