scikit-learn custom transformer is raising NotFitted Error #19953

amitmeel · 2021-04-22T12:21:01Z

Describe the bug

I was experimenting with scikit-learn after updating scikit-learn from 0.21.1 to 1.0.2, and found that the custom transformer had stopped working. I wonder what might have changed in version 1.0.2 which caused this issue. Is there a workaround to resolve this issue?
Below are code snippets of the same to reproduce the issue:

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.tree import DecisionTreeClassifier

class CustomVectorizer(BaseEstimator, TransformerMixin):
    """This class will perform the Vectorization."""
    def __init__(self, custom_content=None, custom_keyphrases=None):
        self.custom_content = custom_content
        self.custom_keyphrases = custom_keyphrases
    
    def fit(self, X, y=None, *args, **kwargs):
        """fit"""
        return self

    def transform(self, X, y=None, **transform_params):
        """Transform"""
        tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
        tfidf_custom_keyphrases = self.custom_keyphrases.transform(X.KeyPhrases.values.astype('U')).todense().astype(np.float32)

        tf_idf_X  = np.hstack((tfidf_custom_content, tfidf_custom_keyphrases))
        return tf_idf_X

class CustomSVD(BaseEstimator, TransformerMixin):
    """This class will perform the dimentionality reduction"""
    def __init__(self, tsvd=None, reduce_dim=True):
        self.tsvd = tsvd
        self.reduce_dim = reduce_dim
    
    def fit(self, X, y=None, *args, **kwargs):
        """fit"""
        return self

    def transform(self, X, y=None, **transform_params):
        """transform"""
        if self.reduce_dim:
            self.new_xtrain = self.tsvd.transform(X)
            return self.new_xtrain.astype(np.float32)
        else:
            return X.astype(np.float32)

example_dict = {"content": ["dfg sdfsd f rtygdfgdf sf sdf sdfdg df", "dfgdf sfdsd fs ertger sd g",
                            "dfgfdgdf fdgdf gfhfhrt", "fghgf c xzcvxwerkjwhx"],
                "KeyPhrases": ["sdfsd erfsd fsdf", " dfgdf ewrwe wef h dfh",
                                "fghfd wesdofjhcxlk sdf", "dfg dfg werwe"],
                "output":["pass", "fail", "pass", "fail"]}
# df = pd.DataFrame(example_dict)
print(df)

X = df[["content", "KeyPhrases"]]
y = df[["output"]]

tf_content = TfidfVectorizer(sublinear_tf=True, smooth_idf=True, stop_words='english')
tf_keyphrase = TfidfVectorizer(sublinear_tf=True, smooth_idf=True, stop_words='english')

tfidf_content = tf_content.fit_transform(X.content).toarray()
tfidf_keyphrases = tf_keyphrase.fit_transform(X.KeyPhrases.values.astype('U')).toarray()

X_temp = np.hstack((tfidf_content, tfidf_keyphrases))

  # on top of X_temp I applied TruncatedSVD  get the n_component for some thresold
# using explained varience here i'm hardcoding the n_components and fitting it.
tsvd = TruncatedSVD(n_components = 20, random_state=42)
X_new = tsvd.fit_transform(X_temp)

# pipeline
# Note: tf_content, tf_keyphrase and tsvd are already fitted and I'm passing them to 
# custom vectorizer and custom svd. hence i'm not doing aything in fit method.
# `cls` is a classifier.
rf_pipline = Pipeline([
    ('vectorizer', CustomVectorizer(tf_content, tf_keyphrase)),
    ('reduce_dim', CustomSVD(tsvd=tsvd)),
    ('rf_classifier', RandomForestClassifier())])

rf_search = {
    'vectorizer': [CustomVectorizer(tf_content, tf_keyphrase)],
    'reduce_dim': [CustomSVD(tsvd=tsvd)],
    'rf_classifier': [RandomForestClassifier()],
    'rf_classifier__n_estimators': [10,20],
}

cls_pipeline = RandomizedSearchCV(rf_pipline, rf_search, n_iter=2, cv=2, verbose=1)

cls_pipeline.fit(X,y)
print(cls_pipeline.score(X,y))
print(cls_pipeline.predict_proba(X))

This was working in scikit-learn 0.21.1 as expected and giving the below output:

------------------------
>>>1.0
[[0.3 0.7]
 [0.6 0.4]
 [0.2 0.8]
 [0.8 0.2]]

but in scikit-learn 1.0.2, I'm getting the below error:

----------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "sklearn_custom_transformer_issue.py", line 106, in <module>
    cls_pipeline.fit(X,y)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\model_selection\_search.py", line 926, in fit
    self.best_estimator_.fit(X, y, **fit_params)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\joblib\memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 893, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\base.py", line 855, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "sklearn_custom_transformer_issue.py", line 36, in transform
    tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\feature_extraction\text.py", line 2099, in transform
    check_is_fitted(self, msg="The TF-IDF vectorizer is not fitted")
  File "C:\Users\user\miniconda3\envs\ai\lib\site-packages\sklearn\utils\validation.py", line 1222, in check_is_fitted
    raise NotFittedError(msg % {"name": type(estimator).__name__})
sklearn.exceptions.NotFittedError: The TF-IDF vectorizer is not fitted

Also, when I defined a new classifier and used the voting classifier as shown below, I'm getting NotFitted Error in scikit-learn 1.0.2 but the same code was working with scikit-learn 0.21.1 :

# another pipeline
dc_pipline1 = Pipeline([
('vectorizer', CustomVectorizer(tf_content, tf_keyphrase)),
('reduce_dim', CustomSVD(tsvd=tsvd)),
('dc_classifier', DecisionTreeClassifier())])

dc_search = {
    'vectorizer': [CustomVectorizer(tf_content, tf_keyphrase)],
    'reduce_dim': [CustomSVD(tsvd=tsvd)],
    'dc_classifier': [DecisionTreeClassifier()],
    'dc_classifier__max_depth': [4, 10],
}

cls_pipeline1 = RandomizedSearchCV(dc_pipline1, dc_search, n_iter=2, cv=2, verbose=1)

Vot_cls = VotingClassifier(estimators=[('rf', cls_pipline), 
                                        ('dt', cls_pipline1)], 
                                       voting='soft')

Vot_cls.fit(X, y)

----------------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "temp.py", line 88, in <module>
    Vot_cls.fit(X, y)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\ensemble\_voting.py", line 292, in fit
    return super().fit(X, transformed_y, sample_weight)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\ensemble\_voting.py", line 74, in fit
    self.estimators_ = Parallel(n_jobs=self.n_jobs)(
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\utils\fixes.py", line 222, in __call__
    return self.function(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\ensemble\_base.py", line 39, in _fit_single_estimator     
    estimator.fit(X, y)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 341, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 303, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\joblib\memory.py", line 352, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\pipeline.py", line 754, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\base.py", line 702, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "temp.py", line 22, in transform
    tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\feature_extraction\text.py", line 1872, in transform      
    check_is_fitted(self, msg='The TF-IDF vectorizer is not fitted')
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "C:\Users\user\anaconda3\envs\ai\lib\site-packages\sklearn\utils\validation.py", line 1041, in check_is_fitted       
    raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: The TF-IDF vectorizer is not fitted

Just wondering what got changed when we call it using the voting classifier. does it clone the estimator and it is not able to pass the fitted instance of tfidf to custom vectorizer.

Versions

scikit-learn=0.24.1
numpy=1.19.2
scipy=1.6.0
pandas=1.2.1
platform: Windows_x64
Python=3.6.10

Note: Code was working fine in scikit-learn version: 0.21.1 , numpy: 1.18.1, scipy: 1.3.1 , pandas:0.25.1.

Reproducible code: code

glemaitre · 2021-04-22T17:50:10Z

Our Pipeline does not follow our own convention: #8157
We usually always clone the parameters in the constructor. The only one that do not do that is Pipeline.
However, the VotingClassifier will clone each pipeline first and the inner estimator. Therefore the TfidfVectorizer will get cloned and it will be equivalent to an unfitted estimator.

I assume that we should fix our Pipeline but this is not straightforward since users are relying on these features. If we have a Pipeline that clone steps, then we should probably look at how to freeze estimator (#8370) such that they don't get unfitted during cloning.

amitmeel · 2021-04-22T19:21:18Z

@glemaitre what i got from your point is that basically pipeline and randomizedsearchcv is making shallow copy and that leads to not able to clone the parameters in the constructor. But question is why it was working with 0.21.1 (did we make this change after this release)
Just thinking, can't we make a deep copy of the estimators so that we will be able to clone the params too (I think this will be at the cost of some extra memory).

Is there any workaround for the current situation?

glemaitre · 2021-04-22T20:41:03Z

@glemaitre what i got from your point is that basically pipeline and randomizedsearchcv is making shallow copy

Pipeline does not do any copy and other estimator including RandomizedSearchCV are cloning (deep copy + deleting fitted attributes).

But question is why it was working with 0.21.3 (did we make this change after this release)

I would need to investigate more. I would not expect the behaviour to have changed.

amitmeel · 2021-04-27T14:51:09Z

@glemaitre is there any workaround for the above issue. Also, did you check for version 0.21.1.

Please let me know if, is there any specific file that I can check to look into it, why I am facing this behavior.

glemaitre · 2021-12-17T19:31:28Z

I assume that the thing that could have changed is the check_is_fitted implementation. Since that your fit does not create any fitted attribute, check_is_fitted would not detect that the estimator is fitted. Now, we have a new API for this case:

def __sklearn_is_fitted__(self):
        return True

adding this method allows to bypass the error and check_is_fitted will see your estimator as always fitted if it is stateless. It is equivalent to our FunctionTransformer.

amitmeel · 2022-03-14T10:05:28Z

Even after adding __sklearn_is_fitted__ to the custom transformer, I'm still getting the same error.

amitmeel · 2022-03-14T10:09:21Z

@glemaitre can you please suggest the workaround since even after adding I'm getting the same error.

amitmeel · 2022-03-14T10:38:04Z

In simple, how can we pass the fitted transformer (in the above case it is tf_content, tf_keyphrase that needs to be passed to CustomVectorizer) in custom transformers so that when we use it with RandomizedSearchCV or VotingClassifier it does not throw NotFittedError and do the transformations per expectation ?

ogrisel · 2022-03-17T18:10:30Z

@amitmeel can you please try to make your reproducer as minimal as possible. I am under the impression that there are many unnecessary steps in your code. This makes it hard for us to understand what's causing you trouble.

https://scikit-learn.org/dev/developers/minimal_reproducer.html#minimal-reproducer

ogrisel · 2022-03-17T18:11:52Z

Something that I find weird in your code is that the CustomVectorizer meta-estimator does not call the fit method of its base estimators.

ogrisel · 2022-03-17T18:13:00Z

Same comment for CustomSVD.

amitmeel · 2022-03-19T06:31:24Z

@ogrisel if you closely look at the code we are passing a fitted transformer/vectorizer to a custom vectorizer.

rf_pipline = Pipeline([
    ('vectorizer', CustomVectorizer(tf_content, tf_keyphrase)),
    ('reduce_dim', CustomSVD(tsvd=tsvd)),
    ('classifier', RandomForestClassifier())])

In the above snippet, this tf_content and tf_keyphrase are fitted instances of TfidfVectorizer, that's why i'm not doing anything in CustomVectorizer fit method.

When I'm running the same code in version 0.21.1 , and checking whether tf_content is fitted or not using check_is_fitted , I can see it is a fitted instance of tfidf but in version 0.24.1 and 1.0.2,I found it weird that it is now a non-fitted instance of tfidf.

Reproducible code: code

In simple, how can we pass the fitted transformer (in the above case it is tf_content, tf_keyphrase that needs to be passed to CustomVectorizer) in custom transformers so that when we use it with RandomizedSearchCV or VotingClassifier it does not throw NotFittedError and do the transformations per expectation?

amitmeel · 2022-03-31T08:52:47Z

@ogrisel @glemaitre any update on the above issue ?

Are we planning to support such scenario where we can pass the fitted transformers to meta estimators (like using some argument in VotingClassifieror or RandomizedSearchCV that this is already fitted) ?
One possible solution for now I can think of is: save the fitted transformers and load them in the CustomVectorizer`` and CustomSVD``` as shown below:

class CustomVectorizer(BaseEstimator, TransformerMixin):
    """This class will perform the Vectorization."""
    def __init__(self):
        self.custom_content = joblib.load('tf_content.pkl')
        self.custom_keyphrases = joblib.load('tf_keyphrase.pkl')
    
    def fit(self, X, y=None, *args, **kwargs):
        """fit"""
        return self

    def transform(self, X, y=None, **transform_params):
        """Transform"""
        tfidf_custom_content = self.custom_content.transform(X.content).todense().astype(np.float32)
        tfidf_custom_keyphrases = self.custom_keyphrases.transform(X.KeyPhrases.values.astype('U')).todense().astype(np.float32)

        tf_idf_X  = np.hstack((tfidf_custom_content, tfidf_custom_keyphrases))
        return tf_idf_X

I'm trying to dig deep into the code base but I'm still not able to understand how it was working with version 0.21.1 .
any other work around for the same ?

adrinjalali · 2022-08-26T08:23:03Z

We still don't have a minimal reproducible here.

You can use __sklearn_is_fitted__ to check if the sub-estimator is fitted and return true. But your code above loads stuff from a file which it shouldn't that should be done in fit. I'm closing this, will re-open once we have a minimal reproducible example.

amitmeel added the Bug: triage label Apr 22, 2021

glemaitre closed this as completed Dec 17, 2021

ogrisel reopened this Mar 17, 2022

adrinjalali closed this as completed Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scikit-learn custom transformer is raising NotFitted Error #19953

scikit-learn custom transformer is raising NotFitted Error #19953

amitmeel commented Apr 22, 2021 •

edited

Loading

glemaitre commented Apr 22, 2021

amitmeel commented Apr 22, 2021 •

edited

Loading

glemaitre commented Apr 22, 2021

amitmeel commented Apr 27, 2021 •

edited

Loading

glemaitre commented Dec 17, 2021

amitmeel commented Mar 14, 2022

amitmeel commented Mar 14, 2022

amitmeel commented Mar 14, 2022

ogrisel commented Mar 17, 2022

ogrisel commented Mar 17, 2022

ogrisel commented Mar 17, 2022

amitmeel commented Mar 19, 2022 •

edited

Loading

amitmeel commented Mar 31, 2022

adrinjalali commented Aug 26, 2022

scikit-learn custom transformer is raising NotFitted Error #19953

scikit-learn custom transformer is raising NotFitted Error #19953

Comments

amitmeel commented Apr 22, 2021 • edited Loading

Describe the bug

Versions

glemaitre commented Apr 22, 2021

amitmeel commented Apr 22, 2021 • edited Loading

glemaitre commented Apr 22, 2021

amitmeel commented Apr 27, 2021 • edited Loading

glemaitre commented Dec 17, 2021

amitmeel commented Mar 14, 2022

amitmeel commented Mar 14, 2022

amitmeel commented Mar 14, 2022

ogrisel commented Mar 17, 2022

ogrisel commented Mar 17, 2022

ogrisel commented Mar 17, 2022

amitmeel commented Mar 19, 2022 • edited Loading

amitmeel commented Mar 31, 2022

adrinjalali commented Aug 26, 2022

amitmeel commented Apr 22, 2021 •

edited

Loading

amitmeel commented Apr 22, 2021 •

edited

Loading

amitmeel commented Apr 27, 2021 •

edited

Loading

amitmeel commented Mar 19, 2022 •

edited

Loading