Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: sklearn.feature_extraction.text.HashingVectorizer.fit_transform raises ValueError when it shouldn't #8941

Closed
@BenKaehler

Description

@BenKaehler

Description

sklearn.feature_extraction.text.HashingVectorizer.fit_transform raises ValueError: indices and data should have the same size for data of a certain length. If you chunk the same data it runs fine.

Steps/Code to Reproduce

import sklearn
from sklearn.feature_extraction.text import HashingVectorizer
print('scikit-learn version')
print(sklearn.__version__)
vectorizer = HashingVectorizer(
    analyzer='char', non_negative=True,
    n_features=1024, ngram_range=[4,16])
X = ['A'*1432]*203452
print('works')
vectorizer.fit_transform(X[:100000])
print('does not work')
vectorizer.fit_transform(X)

Expected Results

scikit-learn version
0.18.1
works
does not work

Actual Results

scikit-learn version
0.18.1
works
does not work
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-aae200adab09> in <module>()
     10 vectorizer.fit_transform(X[:100000])
     11 print('does not work')
---> 12 vectorizer.fit_transform(X)

/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/sklearn/feature_extraction/text.py in transform(self, X, y)
    485 
    486         analyzer = self.build_analyzer()
--> 487         X = self._get_hasher().transform(analyzer(doc) for doc in X)
    488         if self.binary:
    489             X.data.fill(1)

/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/sklearn/feature_extraction/hashing.py in transform(self, raw_X, y)
    147 
    148         X = sp.csr_matrix((values, indices, indptr), dtype=self.dtype,
--> 149                           shape=(n_samples, self.n_features))
    150         X.sum_duplicates()  # also sorts the indices
    151         if self.non_negative:

/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/scipy/sparse/compressed.py in __init__(self, arg1, shape, dtype, copy)
     96             self.data = np.asarray(self.data, dtype=dtype)
     97 
---> 98         self.check_format(full_check=False)
     99 
    100     def getnnz(self, axis=None):

/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/scipy/sparse/compressed.py in check_format(self, full_check)
    165         # check index and data arrays
    166         if (len(self.indices) != len(self.data)):
--> 167             raise ValueError("indices and data should have the same size")
    168         if (self.indptr[-1] > len(self.indices)):
    169             raise ValueError("Last value of index pointer should be less than "

ValueError: indices and data should have the same size

Versions

Darwin-16.5.0-x86_64-i386-64bit
Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar  6 2017, 12:15:08) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.12.1
SciPy 0.19.0
Scikit-Learn 0.18.1

and

Linux-2.6.32-504.16.2.el6.x86_64-x86_64-with-centos-6.6-Final
Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar  6 2017, 11:58:13) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.12.1
SciPy 0.19.0
Scikit-Learn 0.18.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions