Description
Description
The "OverflowError: signed integer is greater than maximum" error messages occurred when calling selector.fit_transform() to select significant features from a large (millions to billions of samples and hundreds of thousands of features) sparse dictionary vector.
Steps/Code to Reproduce
protein_vec = DictVectorizer(sparse=True, dtype=np.uint16).fit(protein_in_pairs)
selector = GenericUnivariateSelect(chi2, 'fpr', param=UserInput.fpr_alpha)
protein_vec_selected = selector.fit_transform(protein_vec.transform(protein_in_pairs), labels_balanced)
Expected Results
No error is thrown. Significant features return.
Actual Results
File "/home/xx/.conda/envs/seqfeaturizer/lib/python3.6/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 292, in transform
return self._transform(X, fitting=False)
File "/home/xx/.conda/envs/seqfeaturizer/lib/python3.6/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 181, in _transform
indptr.append(len(indices))
OverflowError: signed integer is greater than maximum
Versions
System:
python: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0]
executable: /home/xx/.conda/envs/seqfeaturizer/bin/python
machine: Linux-3.10.0-693.11.6.el7.x86_64-x86_64-with-centos-7.4.1708-Core
BLAS:
macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
lib_dirs: /home/xx/.conda/envs/seqfeaturizer/lib
cblas_libs: mkl_rt, pthread
Python deps:
pip: 18.1
setuptools: 40.6.3
sklearn: 0.20.2
numpy: 1.15.4
scipy: 1.1.0
Cython: None
pandas: 0.23.4