Closed
Description
Description
sklearn.feature_extraction.text.HashingVectorizer.fit_transform
raises ValueError: indices and data should have the same size
for data of a certain length. If you chunk the same data it runs fine.
Steps/Code to Reproduce
import sklearn
from sklearn.feature_extraction.text import HashingVectorizer
print('scikit-learn version')
print(sklearn.__version__)
vectorizer = HashingVectorizer(
analyzer='char', non_negative=True,
n_features=1024, ngram_range=[4,16])
X = ['A'*1432]*203452
print('works')
vectorizer.fit_transform(X[:100000])
print('does not work')
vectorizer.fit_transform(X)
Expected Results
scikit-learn version
0.18.1
works
does not work
Actual Results
scikit-learn version
0.18.1
works
does not work
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-aae200adab09> in <module>()
10 vectorizer.fit_transform(X[:100000])
11 print('does not work')
---> 12 vectorizer.fit_transform(X)
/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/sklearn/feature_extraction/text.py in transform(self, X, y)
485
486 analyzer = self.build_analyzer()
--> 487 X = self._get_hasher().transform(analyzer(doc) for doc in X)
488 if self.binary:
489 X.data.fill(1)
/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/sklearn/feature_extraction/hashing.py in transform(self, raw_X, y)
147
148 X = sp.csr_matrix((values, indices, indptr), dtype=self.dtype,
--> 149 shape=(n_samples, self.n_features))
150 X.sum_duplicates() # also sorts the indices
151 if self.non_negative:
/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/scipy/sparse/compressed.py in __init__(self, arg1, shape, dtype, copy)
96 self.data = np.asarray(self.data, dtype=dtype)
97
---> 98 self.check_format(full_check=False)
99
100 def getnnz(self, axis=None):
/Users/benkaehler/miniconda3/envs/qiime2-2017.4/lib/python3.5/site-packages/scipy/sparse/compressed.py in check_format(self, full_check)
165 # check index and data arrays
166 if (len(self.indices) != len(self.data)):
--> 167 raise ValueError("indices and data should have the same size")
168 if (self.indptr[-1] > len(self.indices)):
169 raise ValueError("Last value of index pointer should be less than "
ValueError: indices and data should have the same size
Versions
Darwin-16.5.0-x86_64-i386-64bit
Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar 6 2017, 12:15:08)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.12.1
SciPy 0.19.0
Scikit-Learn 0.18.1
and
Linux-2.6.32-504.16.2.el6.x86_64-x86_64-with-centos-6.6-Final
Python 3.5.3 |Continuum Analytics, Inc.| (default, Mar 6 2017, 11:58:13)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.12.1
SciPy 0.19.0
Scikit-Learn 0.18.1
Metadata
Metadata
Assignees
Labels
No labels